版权所有。未经出版商明确书面许可,不得以任何方式复制或使用本书或其任何部分,但在书评中使用简短引文除外。
All rights reserved. This book or any portion thereof may not be reproduced or used in any manner whatsoever without the express written permission of the publisher except for the use of brief quotations in a book review.
关于作者:
About the author:
Alex Xu 是一位经验丰富的软件工程师和企业家。此前,他曾在 Twitter、Apple 和 Zynga 工作。他在卡内基梅隆大学获得硕士学位。他对设计和实现复杂系统充满热情。
Alex Xu is an experienced software engineer and entrepreneur. Previously, he worked at Twitter, Apple, and Zynga. He received his M.S. from Carnegie Mellon University. He has a passion for designing and implementing complex systems.
欲了解更多信息,请联系systemdesigninsider@gmail.com
For more information, contact systemdesigninsider@gmail.com
编辑:保罗·所罗门
Editor: Paul Solomon
加入电子邮件列表
Join the Email List
我们即将完成 10 多个真实系统设计面试问题。如果您想在新章节可用时收到通知,请订阅我们的电子邮件列表:http://bit.ly/systemmail
We are getting close to finishing more than 10 real-world system design interview questions. Please subscribe to our email list if you want to be notified when new chapters are available: http://bit.ly/systemmail
加入社区
Join the community
我创建了一个仅限会员的 Discord 群组。它专为以下主题的社区讨论而设计:
I created a member-only Discord group. It is designed for community discussions on the following topics:
•系统设计基础。
•System design fundamentals.
•展示设计图并获取反馈。
•Showcase design diagrams and get feedback.
•寻找模拟面试伙伴。
•Find mock interview buddies.
•与社区成员的一般聊天。
•General chat with community members.
请立即加入我们,通过单击下面的链接或扫描条形码向社区介绍自己。
Come join us and introduce yourself to the community today by clicking the link below or scan the barcode.
目录
Table of Contents
System Design Interview: An Insider’s Guide
CHAPTER 1: SCALE FROM ZERO TO MILLIONS OF USERS
CHAPTER 2: BACK-OF-THE-ENVELOPE ESTIMATION
CHAPTER 3: A FRAMEWORK FOR SYSTEM DESIGN INTERVIEWS
CHAPTER 4: DESIGN A RATE LIMITER
CHAPTER 5: DESIGN CONSISTENT HASHING
CHAPTER 6: DESIGN A KEY-VALUE STORE
CHAPTER 7: DESIGN A UNIQUE ID GENERATOR IN DISTRIBUTED SYSTEMS
CHAPTER 8: DESIGN A URL SHORTENER
CHAPTER 9: DESIGN A WEB CRAWLER
CHAPTER 10: DESIGN A NOTIFICATION SYSTEM
CHAPTER 11: DESIGN A NEWS FEED SYSTEM
CHAPTER 12: DESIGN A CHAT SYSTEM
CHAPTER 13: DESIGN A SEARCH AUTOCOMPLETE SYSTEM
CHAPTER 15: DESIGN GOOGLE DRIVE
CHAPTER 16: THE LEARNING CONTINUES
我们很高兴您决定加入我们学习系统设计面试。系统设计面试问题是所有技术面试中最难解决的。这些问题要求受访者设计一个软件系统的架构,可以是新闻源、谷歌搜索、聊天系统等。这些问题令人生畏,而且没有一定的模式可循。这些问题通常范围很大且模糊。这些过程是开放式的且不明确,没有标准或正确的答案。
We are delighted that you have decided to join us in learning the system design interviews. System design interview questions are the most difficult to tackle among all the technical interviews. The questions require the interviewees to design an architecture for a software system, which could be a news feed, Google search, chat system, etc. These questions are intimidating, and there is no certain pattern to follow. The questions are usually very big scoped and vague. The processes are open-ended and unclear without a standard or correct answer.
公司广泛采用系统设计面试,因为这些面试测试的沟通和解决问题的能力与软件工程师日常工作所需的能力相似。评估受访者的标准是她如何分析模糊的问题以及如何逐步解决问题。测试的能力还包括她如何解释想法、与他人讨论以及评估和优化系统。在英语中,使用“she”比“he or she”更流畅,或者在两者之间跳跃。为了方便阅读,我们在本书中使用阴性代词。无意对男性工程师表示不尊重。
Companies widely adopt system design interviews because the communication and problem-solving skills tested in these interviews are similar to those required by a software engineer’s daily work. An interviewee is evaluated based on how she analyzes a vague problem and how she solves the problem step by step. The abilities tested also involve how she explains the idea, discusses with others, and evaluates and optimizes the system. In English, using “she” flows better than “he or she” or jumping between the two. To make reading easier, we use the feminine pronoun throughout this book. No disrespect is intended for male engineers.
系统设计问题是开放式的。就像在现实世界中一样,系统也存在许多差异和变化。期望的结果是提出一个架构来实现系统设计目标。根据面试官的不同,讨论可能会以不同的方式进行。有些面试官可能会选择高层架构来覆盖各个方面;而有些人可能会选择一个或多个领域来关注。通常,应该充分理解系统要求、约束和瓶颈,以决定面试官和受访者的方向。
The system design questions are open-ended. Just like in the real world, there are many differences and variations in the system. The desired outcome is to come up with an architecture to achieve system design goals. The discussions could go in different ways depending on the interviewer. Some interviewers may choose high-level architecture to cover all aspects; whereas some might choose one or more areas to focus on. Typically, system requirements, constraints and bottlenecks should be well understood to shape the direction of both the interviewer and interviewee.
本书的目的是提供解决系统设计问题的可靠策略。正确的策略和知识对于面试的成功至关重要。
The objective of this book is to provide a reliable strategy to approach the system design questions. The right strategy and knowledge are vital to the success of an interview.
本书提供了构建可扩展系统的扎实知识。通过阅读本书获得的知识越多,您解决系统设计问题的能力就越强。
This book provides solid knowledge in building a scalable system. The more knowledge gained from reading this book, the better you are equipped in solving the system design questions.
本书还提供了如何解决系统设计问题的分步框架。它提供了许多示例来说明系统方法以及您可以遵循的详细步骤。通过不断的练习,您将有能力解决系统设计面试问题。
This book also provides a step by step framework on how to tackle a system design question. It provides many examples to illustrate the systematic approach with detailed steps that you can follow. With constant practice, you will be well-equipped to tackle system design interview questions.
设计一个支持数百万用户的系统具有挑战性,这是一个需要不断完善和不断改进的旅程。在本章中,我们构建一个支持单个用户的系统,并逐步扩展以服务数百万用户。读完本章后,您将掌握一些技巧,帮助您解决系统设计面试问题。
Designing a system that supports millions of users is challenging, and it is a journey that requires continuous refinement and endless improvement. In this chapter, we build a system that supports a single user and gradually scale it up to serve millions of users. After reading this chapter, you will master a handful of techniques that will help you to crack the system design interview questions.
千里之行始于足下,构建复杂的系统也不例外。从简单的事情开始,一切都在单个服务器上运行。图 1-1 显示了单服务器设置的图示,其中所有内容都在一台服务器上运行:Web 应用程序、数据库、缓存等。
A journey of a thousand miles begins with a single step, and building a complex system is no different. To start with something simple, everything is running on a single server. Figure 1-1 shows the illustration of a single server setup where everything is running on one server: web app, database, cache, etc.
要了解此设置,调查请求流和流量源会很有帮助。我们首先看一下请求流程(图1-2)。
To understand this setup, it is helpful to investigate the request flow and traffic source. Let us first look at the request flow (Figure 1-2).
1、用户通过域名访问网站,如api.mysite.com。通常,域名系统 (DNS) 是由第三方提供的付费服务,而不是由我们的服务器托管。
1. Users access websites through domain names, such as api.mysite.com. Usually, the Domain Name System (DNS) is a paid service provided by 3rd parties and not hosted by our servers.
2. 互联网协议 (IP) 地址返回至浏览器或移动应用程序。在示例中,返回 IP 地址 15.125.23.214。
2. Internet Protocol (IP) address is returned to the browser or mobile app. In the example, IP address 15.125.23.214 is returned.
3. 获取 IP 地址后,超文本传输协议 (HTTP) [1] 请求将直接发送到您的 Web 服务器。
3. Once the IP address is obtained, Hypertext Transfer Protocol (HTTP) [1] requests are sent directly to your web server.
4. Web 服务器返回 HTML 页面或 JSON 响应进行渲染。
4. The web server returns HTML pages or JSON response for rendering.
接下来我们来看看流量来源。Web 服务器的流量来自两个来源:Web 应用程序和移动应用程序。
Next, let us examine the traffic source. The traffic to your web server comes from two sources: web application and mobile application.
• Web 应用程序:它使用服务器端语言(Java、Python 等)的组合来处理业务逻辑、存储等,并使用客户端语言(HTML 和JavaScript)进行呈现。
•Web application: it uses a combination of server-side languages (Java, Python, etc.) to handle business logic, storage, etc., and client-side languages (HTML and JavaScript) for presentation.
•移动应用程序:HTTP 协议是移动应用程序和Web 服务器之间的通信协议。JavaScript 对象表示法 (JSON) 由于其简单性而成为常用的 API 响应格式来传输数据。JSON 格式的 API 响应示例如下所示:
•Mobile application: HTTP protocol is the communication protocol between the mobile app and the web server. JavaScript Object Notation (JSON) is commonly used API response format to transfer data due to its simplicity. An example of the API response in JSON format is shown below:
GET /users/12 – 检索 id = 12 的用户对象
GET /users/12 – Retrieve user object for id = 12
随着用户群的增长,一台服务器已经不够了,我们需要多台服务器:一台用于网络/移动流量,另一台用于数据库(图1-3)。将 Web/移动流量(Web 层)和数据库(数据层)服务器分开,使它们能够独立扩展。
With the growth of the user base, one server is not enough, and we need multiple servers: one for web/mobile traffic, the other for the database (Figure 1-3). Separating web/mobile traffic (web tier) and database (data tier) servers allows them to be scaled independently.
您可以在传统关系数据库和非关系数据库之间进行选择。让我们来看看它们的差异。
You can choose between a traditional relational database and a non-relational database. Let us examine their differences.
关系数据库也称为关系数据库管理系统 (RDBMS) 或 SQL 数据库。最流行的是 MySQL、Oracle 数据库、PostgreSQL 等。关系数据库以表和行的形式表示和存储数据。您可以使用 SQL 跨不同的数据库表执行联接操作。
Relational databases are also called a relational database management system (RDBMS) or SQL database. The most popular ones are MySQL, Oracle database, PostgreSQL, etc. Relational databases represent and store data in tables and rows. You can perform join operations using SQL across different database tables.
非关系型数据库也称为NoSQL 数据库。流行的有 CouchDB、Neo4j、Cassandra、HBase、Amazon DynamoDB 等[2]。这些数据库分为四类:键值存储、图形存储、列存储和文档存储。非关系型数据库一般不支持连接操作。
Non-Relational databases are also called NoSQL databases. Popular ones are CouchDB, Neo4j, Cassandra, HBase, Amazon DynamoDB, etc. [2]. These databases are grouped into four categories: key-value stores, graph stores, column stores, and document stores. Join operations are generally not supported in non-relational databases.
对于大多数开发人员来说,关系数据库是最好的选择,因为它们已经存在了 40 多年,而且从历史上看,它们运行良好。但是,如果关系数据库不适合您的特定用例,那么探索关系数据库之外的领域就至关重要。在以下情况下,非关系数据库可能是正确的选择:
For most developers, relational databases are the best option because they have been around for over 40 years and historically, they have worked well. However, if relational databases are not suitable for your specific use cases, it is critical to explore beyond relational databases. Non-relational databases might be the right choice if:
•您的应用程序需要超低延迟。
•Your application requires super-low latency.
•您的数据是非结构化的,或者您没有任何关系数据。
•Your data are unstructured, or you do not have any relational data.
•您只需序列化和反序列化数据(JSON、XML、YAML 等)。
•You only need to serialize and deserialize data (JSON, XML, YAML, etc.).
垂直扩展,也称为“纵向扩展”,是指为服务器添加更多功能(CPU、RAM 等)的过程。水平扩展(称为“横向扩展”)允许您通过向资源池中添加更多服务器来进行扩展。
Vertical scaling, referred to as “scale up”, means the process of adding more power (CPU, RAM, etc.) to your servers. Horizontal scaling, referred to as “scale-out”, allows you to scale by adding more servers into your pool of resources.
当流量较低时,垂直扩展是一个不错的选择,垂直扩展的简单性是其主要优点。不幸的是,它有严重的局限性。
When traffic is low, vertical scaling is a great option, and the simplicity of vertical scaling is its main advantage. Unfortunately, it comes with serious limitations.
•垂直缩放有硬性限制。不可能向单个服务器添加无限的 CPU 和内存。
•Vertical scaling has a hard limit. It is impossible to add unlimited CPU and memory to a single server.
•垂直扩展没有故障转移和冗余。如果一台服务器出现故障,网站/应用程序也会随之完全崩溃。
•Vertical scaling does not have failover and redundancy. If one server goes down, the website/app goes down with it completely.
由于垂直缩放的限制,水平缩放更适合大规模应用。
Horizontal scaling is more desirable for large scale applications due to the limitations of vertical scaling.
在以前的设计中,用户直接连接到Web服务器。如果网络服务器离线,用户将无法访问网站。在另一种情况下,如果许多用户同时访问Web服务器并且达到Web服务器的负载限制,用户通常会遇到响应速度变慢或无法连接到服务器的情况。负载均衡器是解决这些问题的最佳技术。
In the previous design, users are connected to the web server directly. Users will be unable to access the website if the web server is offline. In another scenario, if many users access the web server simultaneously and it reaches the web server’s load limit, users generally experience slower response or fail to connect to the server. A load balancer is the best technique to address these problems.
负载均衡器在负载均衡集中定义的 Web 服务器之间均匀分配传入流量。负载均衡器的工作原理如图1-4所示。
A load balancer evenly distributes incoming traffic among web servers that are defined in a load-balanced set. Figure 1-4 shows how a load balancer works.
如图1-4所示,用户直接连接负载均衡器的公网IP。通过此设置,客户端将无法直接访问 Web 服务器。为了更好的安全性,服务器之间的通信使用私有IP。私有IP是只能在同一网络中的服务器之间可达的IP地址;但是,无法通过互联网访问它。负载均衡器通过私有IP与Web服务器通信。
As shown in Figure 1-4, users connect to the public IP of the load balancer directly. With this setup, web servers are unreachable directly by clients anymore. For better security, private IPs are used for communication between servers. A private IP is an IP address reachable only between servers in the same network; however, it is unreachable over the internet. The load balancer communicates with web servers through private IPs.
在图1-4中,添加负载均衡器和第二个Web服务器后,我们成功解决了无故障转移问题并提高了Web层的可用性。详细说明如下:
In Figure 1-4, after a load balancer and a second web server are added, we successfully solved no failover issue and improved the availability of the web tier. Details are explained below:
•如果服务器1 离线,所有流量将路由到服务器2。这样可以防止网站离线。我们还将向服务器池添加一个新的健康的 Web 服务器来平衡负载。
•If server 1 goes offline, all the traffic will be routed to server 2. This prevents the website from going offline. We will also add a new healthy web server to the server pool to balance the load.
•如果网站流量增长很快,两台服务器不足以处理流量,负载均衡器可以优雅地处理这个问题。您只需向 Web 服务器池添加更多服务器,负载均衡器就会自动开始向它们发送请求。
•If the website traffic grows rapidly, and two servers are not enough to handle the traffic, the load balancer can handle this problem gracefully. You only need to add more servers to the web server pool, and the load balancer automatically starts to send requests to them.
现在 Web 层看起来不错,那么数据层呢?当前的设计只有一个数据库,因此不支持故障转移和冗余。数据库复制是解决这些问题的常用技术。让我们来看看吧。
Now the web tier looks good, what about the data tier? The current design has one database, so it does not support failover and redundancy. Database replication is a common technique to address those problems. Let us take a look.
引自维基百科:“数据库复制可以在许多数据库管理系统中使用,通常在原始(主)和副本(从)之间具有主/从关系”[3]。
Quoted from Wikipedia: “Database replication can be used in many database management systems, usually with a master/slave relationship between the original (master) and the copies (slaves)” [3].
主数据库一般只支持写操作。从数据库从主数据库获取数据的副本,并且仅支持读操作。所有数据修改命令(例如插入、删除或更新)都必须发送到主数据库。大多数应用程序需要更高的读取与写入比率;因此,系统中从数据库的数量通常大于主数据库的数量。图1-5所示为一个主库和多个从库的示意图。
A master database generally only supports write operations. A slave database gets copies of the data from the master database and only supports read operations. All the data-modifying commands like insert, delete, or update must be sent to the master database. Most applications require a much higher ratio of reads to writes; thus, the number of slave databases in a system is usually larger than the number of master databases. Figure 1-5 shows a master database with multiple slave databases.
Advantages of database replication:
•更好的性能:在主从模型中,所有写入和更新都发生在主节点上;而读操作分布在从节点上。该模型提高了性能,因为它允许并行处理更多查询。
•Better performance: In the master-slave model, all writes and updates happen in master nodes; whereas, read operations are distributed across slave nodes. This model improves performance because it allows more queries to be processed in parallel.
•可靠性:如果您的一台数据库服务器因自然灾害(例如台风或地震)而被毁坏,数据仍然可以保存。您无需担心数据丢失,因为数据是跨多个位置复制的。
•Reliability: If one of your database servers is destroyed by a natural disaster, such as a typhoon or an earthquake, data is still preserved. You do not need to worry about data loss because data is replicated across multiple locations.
•高可用性:通过在不同位置复制数据,即使数据库离线,您的网站仍可保持运行,因为您可以访问存储在另一数据库服务器中的数据。
•High availability: By replicating data across different locations, your website remains in operation even if a database is offline as you can access data stored in another database server.
在上一节中,我们讨论了负载均衡器如何帮助提高系统可用性。我们在这里问同样的问题:如果其中一个数据库离线怎么办?图 1-5 中讨论的架构设计可以处理这种情况:
In the previous section, we discussed how a load balancer helped to improve system availability. We ask the same question here: what if one of the databases goes offline? The architectural design discussed in Figure 1-5 can handle this case:
•如果只有一个从库可用,并且该从库离线,则读操作将暂时定向到主库。一旦发现问题,新的从库就会取代旧的。当有多个从库可用时,读操作会重定向到其他健康的从库。新的数据库服务器将取代旧的数据库服务器。
•If only one slave database is available and it goes offline, read operations will be directed to the master database temporarily. As soon as the issue is found, a new slave database will replace the old one. In case multiple slave databases are available, read operations are redirected to other healthy slave databases. A new database server will replace the old one.
•如果主数据库离线,从数据库将被提升为新的主数据库。所有数据库操作都将暂时在新的主数据库上执行。新的从库将立即替换旧的从库,进行数据复制。在生产系统中,升级新的主数据库更加复杂,因为从数据库中的数据可能不是最新的。需要通过运行数据恢复脚本来更新丢失的数据。尽管其他一些复制方法(例如多主复制和循环复制)可能有所帮助,但这些设置更加复杂;他们的讨论超出了本书的范围。有兴趣的读者可以参考列出的参考资料[4][5]。
•If the master database goes offline, a slave database will be promoted to be the new master. All the database operations will be temporarily executed on the new master database. A new slave database will replace the old one for data replication immediately. In production systems, promoting a new master is more complicated as the data in a slave database might not be up to date. The missing data needs to be updated by running data recovery scripts. Although some other replication methods like multi-masters and circular replication could help, those setups are more complicated; and their discussions are beyond the scope of this book. Interested readers should refer to the listed reference materials [4] [5].
添加负载均衡器和数据库复制后的系统设计如图1-6所示。
Figure 1-6 shows the system design after adding the load balancer and database replication.
Let us take a look at the design:
•用户从DNS 获取负载均衡器的IP 地址。
•A user gets the IP address of the load balancer from DNS.
•用户使用此IP 地址连接负载平衡器。
•A user connects the load balancer with this IP address.
• HTTP 请求被路由到服务器 1 或服务器 2。
•The HTTP request is routed to either Server 1 or Server 2.
• Web 服务器从从属数据库读取用户数据。
•A web server reads user data from a slave database.
• Web 服务器将所有数据修改操作路由到主数据库。这包括写入、更新和删除操作。
•A web server routes any data-modifying operations to the master database. This includes write, update, and delete operations.
现在,您对 Web 和数据层有了深入的了解,是时候改进加载/响应时间了。这可以通过添加缓存层并将静态内容(JavaScript/CSS/图像/视频文件)转移到内容交付网络(CDN)来完成。
Now, you have a solid understanding of the web and data tiers, it is time to improve the load/response time. This can be done by adding a cache layer and shifting static content (JavaScript/CSS/image/video files) to the content delivery network (CDN).
缓存是一个临时存储区域,它将昂贵的响应结果或频繁访问的数据存储在内存中,以便更快地处理后续请求。如图 1-6 所示,每次加载新网页时,都会执行一个或多个数据库调用来获取数据。重复调用数据库对应用程序性能影响很大。缓存可以缓解这个问题。
A cache is a temporary storage area that stores the result of expensive responses or frequently accessed data in memory so that subsequent requests are served more quickly. As illustrated in Figure 1-6, every time a new web page loads, one or more database calls are executed to fetch data. The application performance is greatly affected by calling the database repeatedly. The cache can mitigate this problem.
缓存层是一个临时数据存储层,比数据库快得多。拥有单独的缓存层的好处包括更好的系统性能、减少数据库工作负载的能力以及独立扩展缓存层的能力。图 1-7 显示了缓存服务器的可能设置:
The cache tier is a temporary data store layer, much faster than the database. The benefits of having a separate cache tier include better system performance, ability to reduce database workloads, and the ability to scale the cache tier independently. Figure 1-7 shows a possible setup of a cache server:
收到请求后,Web 服务器首先检查缓存中是否有可用的响应。如果有,它将数据发送回客户端。如果没有,它会查询数据库,将响应存储在缓存中,然后将其发送回客户端。这种缓存策略称为直读缓存。根据数据类型、大小和访问模式,还可以使用其他缓存策略。之前的一项研究解释了不同的缓存策略如何工作 [6]。
After receiving a request, a web server first checks if the cache has the available response. If it has, it sends data back to the client. If not, it queries the database, stores the response in cache, and sends it back to the client. This caching strategy is called a read-through cache. Other caching strategies are available depending on the data type, size, and access patterns. A previous study explains how different caching strategies work [6].
与缓存服务器交互很简单,因为大多数缓存服务器都提供通用编程语言的 API。以下代码片段显示了典型的 Memcached API:
Interacting with cache servers is simple because most cache servers provide APIs for common programming languages. The following code snippet shows typical Memcached APIs:
以下是使用缓存系统的一些注意事项:
Here are a few considerations for using a cache system:
•决定何时使用缓存。当数据频繁读取但不频繁修改时,请考虑使用缓存。由于缓存数据存储在易失性内存中,因此缓存服务器并不适合持久保存数据。例如,如果缓存服务器重新启动,内存中的所有数据都会丢失。因此,重要数据应该保存在持久数据存储中。
•Decide when to use cache. Consider using cache when data is read frequently but modified infrequently. Since cached data is stored in volatile memory, a cache server is not ideal for persisting data. For instance, if a cache server restarts, all the data in memory is lost. Thus, important data should be saved in persistent data stores.
•过期政策。实施过期政策是一个很好的做法。一旦缓存数据过期,就会从缓存中删除。当没有过期策略时,缓存数据将永久保存在内存中。建议不要将过期日期设置得太短,因为这会导致系统过于频繁地从数据库重新加载数据。同时,建议不要将过期日期设置得太长,否则数据可能会过时。
•Expiration policy. It is a good practice to implement an expiration policy. Once cached data is expired, it is removed from the cache. When there is no expiration policy, cached data will be stored in the memory permanently. It is advisable not to make the expiration date too short as this will cause the system to reload data from the database too frequently. Meanwhile, it is advisable not to make the expiration date too long as the data can become stale.
•一致性:这涉及保持数据存储和缓存同步。由于数据存储和缓存上的数据修改操作不在单个事务中,因此可能会发生不一致。当跨多个区域扩展时,保持数据存储和缓存之间的一致性具有挑战性。有关更多详细信息,请参阅 Facebook 发表的题为“Scaling Memcache at Facebook”的论文[7]。
•Consistency: This involves keeping the data store and the cache in sync. Inconsistency can happen because data-modifying operations on the data store and cache are not in a single transaction. When scaling across multiple regions, maintaining consistency between the data store and cache is challenging. For further details, refer to the paper titled “Scaling Memcache at Facebook” published by Facebook [7].
•缓解故障:单个缓存服务器代表潜在的单点故障 (SPOF),维基百科中的定义如下:“单点故障 (SPOF) 是系统的一部分,如果发生故障,将导致整个系统停止运行。系统停止工作”[8]。因此,建议跨不同数据中心使用多个缓存服务器以避免 SPOF。另一种推荐的方法是按一定百分比过度配置所需的内存。随着内存使用量的增加,这提供了一个缓冲区。
•Mitigating failures: A single cache server represents a potential single point of failure (SPOF), defined in Wikipedia as follows: “A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working” [8]. As a result, multiple cache servers across different data centers are recommended to avoid SPOF. Another recommended approach is to overprovision the required memory by certain percentages. This provides a buffer as the memory usage increases.
•逐出策略:一旦缓存已满,任何向缓存添加项目的请求都可能导致现有项目被删除。这称为缓存驱逐。最近最少使用 (LRU) 是最流行的缓存逐出策略。可以采用其他驱逐策略,例如最不常用(LFU)或先进先出(FIFO)来满足不同的用例。
•Eviction Policy: Once the cache is full, any requests to add items to the cache might cause existing items to be removed. This is called cache eviction. Least-recently-used (LRU) is the most popular cache eviction policy. Other eviction policies, such as the Least Frequently Used (LFU) or First in First Out (FIFO), can be adopted to satisfy different use cases.
CDN 是一个由地理上分散的服务器组成的网络,用于传送静态内容。CDN 服务器缓存静态内容,如图像、视频、CSS、JavaScript 文件等。
A CDN is a network of geographically dispersed servers used to deliver static content. CDN servers cache static content like images, videos, CSS, JavaScript files, etc.
动态内容缓存是一个相对较新的概念,超出了本书的范围。它允许缓存基于请求路径、查询字符串、cookie 和请求标头的 HTML 页面。有关更多信息,请参阅参考资料 [9] 中提到的文章。本书重点介绍如何使用CDN来缓存静态内容。
Dynamic content caching is a relatively new concept and beyond the scope of this book. It enables the caching of HTML pages that are based on request path, query strings, cookies, and request headers. Refer to the article mentioned in reference material [9] for more about this. This book focuses on how to use CDN to cache static content.
CDN 的高层工作原理如下:当用户访问网站时,距离用户最近的 CDN 服务器将传送静态内容。直观上,来自 CDN 服务器的用户越远,网站加载速度越慢。例如,如果 CDN 服务器位于旧金山,则洛杉矶的用户将比欧洲的用户更快地获取内容。图 1-9 是一个很好的示例,展示了 CDN 如何缩短加载时间。
Here is how CDN works at the high-level: when a user visits a website, a CDN server closest to the user will deliver static content. Intuitively, the further users are from CDN servers, the slower the website loads. For example, if CDN servers are in San Francisco, users in Los Angeles will get content faster than users in Europe. Figure 1-9 is a great example that shows how CDN improves load time.
CDN的工作流程如图1-10所示。
Figure 1-10 demonstrates the CDN workflow.
1. 用户A尝试使用图片URL获取image.png。URL 的域由 CDN 提供商提供。以下两个图像 URL 是用于演示图像 URL 在 Amazon 和 Akamai CDN 上的外观的示例:
1. User A tries to get image.png by using an image URL. The URL’s domain is provided by the CDN provider. The following two image URLs are samples used to demonstrate what image URLs look like on Amazon and Akamai CDNs:
• https://mysite.cloudfront.net/logo.jpg
•https://mysite.cloudfront.net/logo.jpg
• https://mysite.akamai.com/image-manager/img/logo.jpg
•https://mysite.akamai.com/image-manager/img/logo.jpg
2. 如果 CDN 服务器的缓存中没有 image.png,则 CDN 服务器会向源请求该文件,该源可以是 Web 服务器或 Amazon S3 等在线存储。
2. If the CDN server does not have image.png in the cache, the CDN server requests the file from the origin, which can be a web server or online storage like Amazon S3.
3. 源端将 image.png 返回到 CDN 服务器,其中包含可选的 HTTP 标头生存时间 (TTL),用于描述图像缓存的时间。
3. The origin returns image.png to the CDN server, which includes optional HTTP header Time-to-Live (TTL) which describes how long the image is cached.
4. CDN 缓存该图片并将其返回给用户 A。该图片将一直缓存在 CDN 中,直至 TTL 过期。
4. The CDN caches the image and returns it to User A. The image remains cached in the CDN until the TTL expires.
5. 用户B发送请求获取相同的图像。
5. User B sends a request to get the same image.
6. 只要TTL未过期,图像就会从缓存中返回。
6. The image is returned from the cache as long as the TTL has not expired.
•成本:CDN 由第三方提供商运营,您需要为进出CDN 的数据传输付费。缓存不常用的资产不会带来任何显着的好处,因此您应该考虑将它们移出 CDN。
•Cost: CDNs are run by third-party providers, and you are charged for data transfers in and out of the CDN. Caching infrequently used assets provides no significant benefits so you should consider moving them out of the CDN.
•设置适当的缓存过期时间:对于时间敏感的内容,设置缓存过期时间非常重要。缓存过期时间不能太长也不能太短。如果太长,内容可能不再新鲜。如果太短,可能会导致内容从源服务器重复重新加载到 CDN。
•Setting an appropriate cache expiry: For time-sensitive content, setting a cache expiry time is important. The cache expiry time should neither be too long nor too short. If it is too long, the content might no longer be fresh. If it is too short, it can cause repeat reloading of content from origin servers to the CDN.
• CDN 后备:您应该考虑您的网站/应用程序如何应对CDN 故障。如果出现临时 CDN 中断,客户端应该能够检测到问题并向源请求资源。
•CDN fallback: You should consider how your website/application copes with CDN failure. If there is a temporary CDN outage, clients should be able to detect the problem and request resources from the origin.
•使文件失效:您可以通过执行以下操作之一在文件过期之前将其从CDN 中删除:
•Invalidating files: You can remove a file from the CDN before it expires by performing one of the following operations:
•使用CDN 供应商提供的API 使CDN 对象失效。
•Invalidate the CDN object using APIs provided by CDN vendors.
•使用对象版本控制来提供不同版本的对象。要对对象进行版本控制,您可以向 URL 添加参数,例如版本号。例如,版本号 2 添加到查询字符串:image.png?v=2。
•Use object versioning to serve a different version of the object. To version an object, you can add a parameter to the URL, such as a version number. For example, version number 2 is added to the query string: image.png?v=2.
添加CDN和缓存后的设计如图1-11所示。
Figure 1-11 shows the design after the CDN and cache are added.
1. 静态资源(JS、CSS、图像等)不再由 Web 服务器提供服务。它们是从 CDN 获取的,以获得更好的性能。
1. Static assets (JS, CSS, images, etc.,) are no longer served by web servers. They are fetched from the CDN for better performance.
2、通过缓存数据减轻数据库负载。
2. The database load is lightened by caching data.
现在是时候考虑水平扩展 Web 层了。为此,我们需要将状态(例如用户会话数据)移出 Web 层。一个好的实践是将会话数据存储在持久存储中,例如关系数据库或 NoSQL。集群中的每个 Web 服务器都可以访问数据库中的状态数据。这称为无状态 Web 层。
Now it is time to consider scaling the web tier horizontally. For this, we need to move state (for instance user session data) out of the web tier. A good practice is to store session data in the persistent storage such as relational database or NoSQL. Each web server in the cluster can access state data from databases. This is called stateless web tier.
有状态服务器和无状态服务器有一些关键区别。有状态服务器会记住从一个请求到下一个请求的客户端数据(状态)。无状态服务器不保留状态信息。
A stateful server and stateless server has some key differences. A stateful server remembers client data (state) from one request to the next. A stateless server keeps no state information.
图 1-12 显示了有状态架构的示例。
Figure 1-12 shows an example of a stateful architecture.
在图 1-12 中,用户 A 的会话数据和个人资料图像存储在服务器 1 中。要对用户 A 进行身份验证,必须将 HTTP 请求路由到服务器 1。如果将请求发送到服务器 2 等其他服务器,则身份验证将失败,因为服务器2 不包含用户A的会话数据。同样,来自用户 B 的所有 HTTP 请求都必须路由到服务器 2;来自用户 C 的所有请求都必须发送到服务器 3。
In Figure 1-12, user A’s session data and profile image are stored in Server 1. To authenticate User A, HTTP requests must be routed to Server 1. If a request is sent to other servers like Server 2, authentication would fail because Server 2 does not contain User A’s session data. Similarly, all HTTP requests from User B must be routed to Server 2; all requests from User C must be sent to Server 3.
问题是来自同一客户端的每个请求都必须路由到同一服务器。这可以通过大多数负载均衡器中的粘性会话来完成[10];然而,这增加了开销。使用这种方法添加或删除服务器要困难得多。处理服务器故障也具有挑战性。
The issue is that every request from the same client must be routed to the same server. This can be done with sticky sessions in most load balancers [10]; however, this adds the overhead. Adding or removing servers is much more difficult with this approach. It is also challenging to handle server failures.
无状态架构如图1-13所示。
Figure 1-13 shows the stateless architecture.
在这种无状态架构中,来自用户的 HTTP 请求可以发送到任何 Web 服务器,这些服务器从共享数据存储中获取状态数据。状态数据存储在共享数据存储中,并且远离 Web 服务器。无状态系统更简单、更健壮且可扩展。
In this stateless architecture, HTTP requests from users can be sent to any web servers, which fetch state data from a shared data store. State data is stored in a shared data store and kept out of web servers. A stateless system is simpler, more robust, and scalable.
图 1-14 显示了具有无状态 Web 层的更新设计。
Figure 1-14 shows the updated design with a stateless web tier.
在图 1-14 中,我们将会话数据移出 Web 层并将它们存储在持久数据存储中。共享数据存储可以是关系数据库、Memcached/Redis、NoSQL等。选择NoSQL数据存储是因为它易于扩展。自动缩放是指根据流量负载自动添加或删除 Web 服务器。从 Web 服务器中删除状态数据后,可以通过根据流量负载添加或删除服务器来轻松实现 Web 层的自动扩展。
In Figure 1-14, we move the session data out of the web tier and store them in the persistent data store. The shared data store could be a relational database, Memcached/Redis, NoSQL, etc. The NoSQL data store is chosen as it is easy to scale. Autoscaling means adding or removing web servers automatically based on the traffic load. After the state data is removed out of web servers, auto-scaling of the web tier is easily achieved by adding or removing servers based on traffic load.
您的网站发展迅速并吸引了大量国际用户。为了提高可用性并在更广泛的地理区域提供更好的用户体验,支持多个数据中心至关重要。
Your website grows rapidly and attracts a significant number of users internationally. To improve availability and provide a better user experience across wider geographical areas, supporting multiple data centers is crucial.
图 1-15 显示了具有两个数据中心的示例设置。在正常操作中,用户通过 geoDNS 路由(也称为地理路由)到最近的数据中心,美国东部的流量分配为x% ,美国西部的流量分配为 (100 – x)% 。geoDNS 是一项 DNS 服务,允许根据用户位置将域名解析为 IP 地址。
Figure 1-15 shows an example setup with two data centers. In normal operation, users are geoDNS-routed, also known as geo-routed, to the closest data center, with a split traffic of x% in US-East and (100 – x)% in US-West. geoDNS is a DNS service that allows domain names to be resolved to IP addresses based on the location of a user.
如果数据中心发生重大中断,我们会将所有流量引导至运行状况良好的数据中心。如图1-16所示,数据中心2(美国西部)离线,100%流量路由至数据中心1(美国东部)。
In the event of any significant data center outage, we direct all traffic to a healthy data center. In Figure 1-16, data center 2 (US-West) is offline, and 100% of the traffic is routed to data center 1 (US-East).
Several technical challenges must be resolved to achieve multi-data center setup:
•流量重定向:需要有效的工具将流量引导至正确的数据中心。GeoDNS 可用于根据用户所在位置将流量引导至最近的数据中心。
•Traffic redirection: Effective tools are needed to direct traffic to the correct data center. GeoDNS can be used to direct traffic to the nearest data center depending on where a user is located.
•数据同步:不同地区的用户可以使用不同的本地数据库或缓存。在故障转移情况下,流量可能会路由到数据不可用的数据中心。常见的策略是跨多个数据中心复制数据。之前的一项研究展示了Netflix如何实现异步多数据中心复制[11]。
•Data synchronization: Users from different regions could use different local databases or caches. In failover cases, traffic might be routed to a data center where data is unavailable. A common strategy is to replicate data across multiple data centers. A previous study shows how Netflix implements asynchronous multi-data center replication [11].
•测试和部署:对于多数据中心设置,在不同位置测试您的网站/应用程序非常重要。自动化部署工具对于保持所有数据中心的服务一致至关重要[11]。
•Test and deployment: With multi-data center setup, it is important to test your website/application at different locations. Automated deployment tools are vital to keep services consistent through all the data centers [11].
为了进一步扩展我们的系统,我们需要解耦系统的不同组件,以便它们可以独立扩展。消息队列是许多现实世界的分布式系统用来解决这个问题的关键策略。
To further scale our system, we need to decouple different components of the system so they can be scaled independently. Messaging queue is a key strategy employed by many real-world distributed systems to solve this problem.
消息队列是一个持久组件,存储在内存中,支持异步通信。它充当缓冲区并分发异步请求。消息队列的基本架构很简单。输入服务(称为生产者/发布者)创建消息并将其发布到消息队列。其他服务或服务器(称为消费者/订阅者)连接到队列,并执行消息定义的操作。模型如图1-17所示。
A message queue is a durable component, stored in memory, that supports asynchronous communication. It serves as a buffer and distributes asynchronous requests. The basic architecture of a message queue is simple. Input services, called producers/publishers, create messages, and publish them to a message queue. Other services or servers, called consumers/subscribers, connect to the queue, and perform actions defined by the messages. The model is shown in Figure 1-17.
解耦使消息队列成为构建可扩展且可靠的应用程序的首选架构。使用消息队列,当消费者无法处理消息时,生产者可以将消息发布到队列中。即使生产者不可用,消费者也可以从队列中读取消息。
Decoupling makes the message queue a preferred architecture for building a scalable and reliable application. With the message queue, the producer can post a message to the queue when the consumer is unavailable to process it. The consumer can read messages from the queue even when the producer is unavailable.
考虑以下用例:您的应用程序支持照片自定义,包括裁剪、锐化、模糊等。这些自定义任务需要时间才能完成。在图1-18中,Web服务器将照片处理作业发布到消息队列。照片处理工作人员从消息队列中获取作业并异步执行照片定制任务。生产者和消费者可以独立扩展。当队列的大小变大时,会添加更多的工作人员以减少处理时间。但是,如果队列大部分时间都是空的,则可以减少worker的数量。
Consider the following use case: your application supports photo customization, including cropping, sharpening, blurring, etc. Those customization tasks take time to complete. In Figure 1-18, web servers publish photo processing jobs to the message queue. Photo processing workers pick up jobs from the message queue and asynchronously perform photo customization tasks. The producer and the consumer can be scaled independently. When the size of the queue becomes large, more workers are added to reduce the processing time. However, if the queue is empty most of the time, the number of workers can be reduced.
当使用在几台服务器上运行的小型网站时,日志记录、指标和自动化支持是很好的做法,但不是必需的。然而,既然您的网站已经发展到可以为大型企业提供服务,那么投资这些工具就至关重要。
When working with a small website that runs on a few servers, logging, metrics, and automation support are good practices but not a necessity. However, now that your site has grown to serve a large business, investing in those tools is essential.
日志记录:监视错误日志很重要,因为它有助于识别系统中的错误和问题。您可以监控每个服务器级别的错误日志,或使用工具将它们聚合到集中式服务中,以便于搜索和查看。
Logging: Monitoring error logs is important because it helps to identify errors and problems in the system. You can monitor error logs at per server level or use tools to aggregate them to a centralized service for easy search and viewing.
指标:收集不同类型的指标有助于我们获得业务洞察并了解系统的健康状况。以下一些指标很有用:
Metrics: Collecting different types of metrics help us to gain business insights and understand the health status of the system. Some of the following metrics are useful:
•主机级指标:CPU、内存、磁盘 I/O 等。
•Host level metrics: CPU, Memory, disk I/O, etc.
•聚合级别指标:例如,整个数据库层、缓存层等的性能。
•Aggregated level metrics: for example, the performance of the entire database tier, cache tier, etc.
•关键业务指标:每日活跃用户、留存率、收入等。
•Key business metrics: daily active users, retention, revenue, etc.
自动化:当系统变得庞大而复杂时,我们需要构建或利用自动化工具来提高生产力。持续集成是一种很好的实践,其中每个代码签入都通过自动化进行验证,使团队能够及早发现问题。此外,自动化构建、测试、部署过程等可以显着提高开发人员的工作效率。
Automation: When a system gets big and complex, we need to build or leverage automation tools to improve productivity. Continuous integration is a good practice, in which each code check-in is verified through automation, allowing teams to detect problems early. Besides, automating your build, test, deploy process, etc. could improve developer productivity significantly.
添加消息队列和不同的工具
Adding message queues and different tools
图 1-19 显示了更新后的设计。由于篇幅限制,图中仅展示了一个数据中心。
Figure 1-19 shows the updated design. Due to the space constraint, only one data center is shown in the figure.
1.设计包含消息队列,这有助于使系统更加松散耦合和故障恢复能力。
1. The design includes a message queue, which helps to make the system more loosely coupled and failure resilient.
2. Logging, monitoring, metrics, and automation tools are included.
随着数据每天都在增长,您的数据库会变得更加超载。是时候扩展数据层了。
As the data grows every day, your database gets more overloaded. It is time to scale the data tier.
数据库扩展有两种主要方法:垂直扩展和水平扩展。
There are two broad approaches for database scaling: vertical scaling and horizontal scaling.
垂直扩展,也称为纵向扩展,是通过向现有机器添加更多功率(CPU、RAM、磁盘等)来进行扩展。有一些功能强大的数据库服务器。根据 Amazon Relational Database Service (RDS) [12],您可以获得具有 24 TB RAM 的数据库服务器。这种强大的数据库服务器可以存储和处理大量的数据。例如,2013 年 stackoverflow.com 每月独立访问者超过 1000 万,但它只有 1 个主数据库 [13]。然而,垂直缩放有一些严重的缺点:
Vertical scaling, also known as scaling up, is the scaling by adding more power (CPU, RAM, DISK, etc.) to an existing machine. There are some powerful database servers. According to Amazon Relational Database Service (RDS) [12], you can get a database server with 24 TB of RAM. This kind of powerful database server could store and handle lots of data. For example, stackoverflow.com in 2013 had over 10 million monthly unique visitors, but it only had 1 master database [13]. However, vertical scaling comes with some serious drawbacks:
•您可以向数据库服务器添加更多CPU、RAM 等,但存在硬件限制。如果您拥有大量用户群,那么单个服务器是不够的。
•You can add more CPU, RAM, etc. to your database server, but there are hardware limits. If you have a large user base, a single server is not enough.
•单点故障的风险更大。
•Greater risk of single point of failures.
•垂直扩展的总体成本很高。强大的服务器要贵得多。
•The overall cost of vertical scaling is high. Powerful servers are much more expensive.
水平扩展,也称为分片,是添加更多服务器的做法。图 1-20 比较了垂直缩放和水平缩放。
Horizontal scaling, also known as sharding, is the practice of adding more servers. Figure 1-20 compares vertical scaling with horizontal scaling.
分片将大型数据库分成更小、更容易管理的部分,称为分片。每个分片共享相同的架构,但每个分片上的实际数据对于该分片来说是唯一的。
Sharding separates large databases into smaller, more easily managed parts called shards. Each shard shares the same schema, though the actual data on each shard is unique to the shard.
分片数据库示例如图1-21所示。用户数据根据用户ID分配到数据库服务器。每当您访问数据时,都会使用哈希函数来查找相应的分片。在我们的示例中,使用user_id % 4作为哈希函数。如果结果等于0,则使用分片0来存储和获取数据。如果结果等于 1,则使用分片 1。同样的逻辑也适用于其他分片。
Figure 1-21 shows an example of sharded databases. User data is allocated to a database server based on user IDs. Anytime you access data, a hash function is used to find the corresponding shard. In our example, user_id % 4 is used as the hash function. If the result equals to 0, shard 0 is used to store and fetch data. If the result equals to 1, shard 1 is used. The same logic applies to other shards.
Figure 1-22 shows the user table in sharded databases.
实施分片策略时要考虑的最重要因素是分片键的选择。分片键(称为分区键)由一列或多列组成,用于确定数据的分布方式。如图1-22所示,“user_id”是分片键。分片键允许您通过将数据库查询路由到正确的数据库来高效地检索和修改数据。在选择分片键时,最重要的标准之一是选择能够均匀分布数据的键。
The most important factor to consider when implementing a sharding strategy is the choice of the sharding key. Sharding key (known as a partition key) consists of one or more columns that determine how data is distributed. As shown in Figure 1-22, “user_id” is the sharding key. A sharding key allows you to retrieve and modify data efficiently by routing database queries to the correct database. When choosing a sharding key, one of the most important criteria is to choose a key that can evenly distributed data.
分片是一种扩展数据库的好技术,但它远非完美的解决方案。它给系统带来了复杂性和新的挑战:
Sharding is a great technique to scale the database but it is far from a perfect solution. It introduces complexities and new challenges to the system:
重新分片数据:当 1)单个分片由于快速增长而无法再容纳更多数据时,需要重新分片数据。2) 由于数据分布不均匀,某些分片可能比其他分片更快地耗尽分片。当分片耗尽时,需要更新分片功能并移动数据。一致性哈希是解决这个问题的常用技术,将在第 5 章中讨论。
Resharding data: Resharding data is needed when 1) a single shard could no longer hold more data due to rapid growth. 2) Certain shards might experience shard exhaustion faster than others due to uneven data distribution. When shard exhaustion happens, it requires updating the sharding function and moving data around. Consistent hashing, which will be discussed in Chapter 5, is a commonly used technique to solve this problem.
名人问题:这也称为热点关键问题。对特定分片的过多访问可能会导致服务器过载。想象一下,凯蒂·佩里 (Katy Perry)、贾斯汀·比伯 (Justin Bieber) 和 Lady Gaga 的数据最终都位于同一个分片上。对于社交应用程序,该分片将因读取操作而不堪重负。为了解决这个问题,我们可能需要为每个名人分配一个分片。每个分片甚至可能需要进一步分区。
Celebrity problem: This is also called a hotspot key problem. Excessive access to a specific shard could cause server overload. Imagine data for Katy Perry, Justin Bieber, and Lady Gaga all end up on the same shard. For social applications, that shard will be overwhelmed with read operations. To solve this problem, we may need to allocate a shard for each celebrity. Each shard might even require further partition.
连接和反规范化:一旦数据库被分片到多个服务器上,就很难跨数据库分片执行连接操作。常见的解决方法是对数据库进行非规范化,以便可以在单个表中执行查询。
Join and de-normalization: Once a database has been sharded across multiple servers, it is hard to perform join operations across database shards. A common workaround is to de-normalize the database so that queries can be performed in a single table.
在图1-23中,我们对数据库进行分片以支持快速增长的数据流量。同时,一些非关系型功能被转移到NoSQL数据存储中,以减少数据库负载。这是一篇涵盖 NoSQL 的许多用例的文章 [14]。
In Figure 1-23, we shard databases to support rapidly increasing data traffic. At the same time, some of the non-relational functionalities are moved to a NoSQL data store to reduce the database load. Here is an article that covers many use cases of NoSQL [14].
扩展系统是一个迭代过程。迭代我们在本章中学到的知识可以让我们走得更远。需要更多的微调和新的策略来扩展到数百万用户以上。例如,您可能需要优化系统并将系统与更小的服务解耦。本章学到的所有技术应该为应对新挑战提供良好的基础。作为本章的总结,我们总结了如何扩展系统以支持数百万用户:
Scaling a system is an iterative process. Iterating on what we have learned in this chapter could get us far. More fine-tuning and new strategies are needed to scale beyond millions of users. For example, you might need to optimize your system and decouple the system to even smaller services. All the techniques learned in this chapter should provide a good foundation to tackle new challenges. To conclude this chapter, we provide a summary of how we scale our system to support millions of users:
•保持Web 层无状态
•Keep web tier stateless
•在每一层建立冗余
•Build redundancy at every tier
•尽可能多地缓存数据
•Cache data as much as you can
•支持多个数据中心
•Support multiple data centers
•在 CDN 中托管静态资产
•Host static assets in CDN
•通过分片扩展数据层
•Scale your data tier by sharding
•将层级拆分为单独的服务
•Split tiers into individual services
•监控您的系统并使用自动化工具
•Monitor your system and use automation tools
恭喜您已经走到这一步了!现在拍拍自己的背吧。好工作!
Congratulations on getting this far! Now give yourself a pat on the back. Good job!
参考资料
Reference materials
[1] 超文本传输协议:https ://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
[1] Hypertext Transfer Protocol: https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
[2] 你应该超越关系数据库吗?:
[2] Should you go Beyond Relational Databases?:
https://blog.teamtreehouse.com/should-you-go-beyond-relational-databases
https://blog.teamtreehouse.com/should-you-go-beyond-relational-databases
[3] 复制: https://en.wikipedia.org/wiki/Replication_(computing)
[3] Replication: https://en.wikipedia.org/wiki/Replication_(computing)
[4] 多主复制:
[4] Multi-master replication:
https://en.wikipedia.org/wiki/Multi-master_replication
https://en.wikipedia.org/wiki/Multi-master_replication
[5] NDB集群复制:多主和循环复制:https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-replication-multi-master.html
[5] NDB Cluster Replication: Multi-Master and Circular Replication: https://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-replication-multi-master.html
[6] 缓存策略以及如何选择正确的策略:
[6] Caching Strategies and How to Choose the Right One:
https://codeahoy.com/2017/08/11/caching-strategies-and-how-to-choose-the-right-one/
https://codeahoy.com/2017/08/11/caching-strategies-and-how-to-choose-the-right-one/
[7] R. Nishtala,“Facebook,扩展 Memcache”,第 10 届 USENIX 网络系统设计和实现研讨会 (NSDI '13)。
[7] R. Nishtala, "Facebook, Scaling Memcache at," 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13).
[8] 单点故障:https://en.wikipedia.org/wiki/Single_point_of_failure
[8] Single point of failure: https://en.wikipedia.org/wiki/Single_point_of_failure
[9] Amazon CloudFront 动态内容交付:
[9] Amazon CloudFront Dynamic Content Delivery:
https://aws.amazon.com/cloudfront/dynamic-content/
https://aws.amazon.com/cloudfront/dynamic-content/
[10] 为您的 Classic Load Balancer 配置粘性会话:
[10] Configure Sticky Sessions for Your Classic Load Balancer:
https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-sticky-sessions.html
https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-sticky-sessions.html
[11] 多区域弹性的主动-主动:
[11] Active-Active for Multi-Regional Resiliency:
https://netflixtechblog.com/active-active-for-multi-regional-resiliency-c47719f6685b
https://netflixtechblog.com/active-active-for-multi-regional-resiliency-c47719f6685b
[12] Amazon EC2 高内存实例:
[12] Amazon EC2 High Memory Instances:
https://aws.amazon.com/ec2/instance-types/high-memory/
https://aws.amazon.com/ec2/instance-types/high-memory/
[13] 运行 Stack Overflow 需要什么:
[13] What it takes to run Stack Overflow:
http://nickcraver.com/blog/2013/11/22/what-it-takes-to-run-stack-overflow
http://nickcraver.com/blog/2013/11/22/what-it-takes-to-run-stack-overflow
[14] 你到底在用 NoSQL 做什么:
[14] What The Heck Are You Actually Using NoSQL For:
http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-nosql-for.html
http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-nosql-for.html
在系统设计面试中,有时会要求您使用粗略估计来估计系统容量或性能要求。谷歌高级研究员杰夫·迪恩 (Jeff Dean) 表示,“粗略计算是您使用思想实验和常见性能数据相结合创建的估计值,以便很好地了解哪些设计将满足您的要求”[1]。
In a system design interview, sometimes you are asked to estimate system capacity or performance requirements using a back-of-the-envelope estimation. According to Jeff Dean, Google Senior Fellow, “back-of-the-envelope calculations are estimates you create using a combination of thought experiments and common performance numbers to get a good feel for which designs will meet your requirements” [1].
您需要对可扩展性基础知识有很好的了解,才能有效地进行粗略估计。应该很好地理解以下概念:二的幂 [2]、每个程序员都应该知道的延迟数字以及可用性数字。
You need to have a good sense of scalability basics to effectively carry out back-of-the-envelope estimation. The following concepts should be well understood: power of two [2], latency numbers every programmer should know, and availability numbers.
尽管在处理分布式系统时数据量可能会变得巨大,但计算都归结为基础。为了获得正确的计算结果,了解使用 2 次方的数据量单位至关重要。一个字节是 8 位的序列。一个 ASCII 字符使用一个字节的内存(8 位)。下表解释了数据量单位(表2-1)。
Although data volume can become enormous when dealing with distributed systems, calculation all boils down to the basics. To obtain correct calculations, it is critical to know the data volume unit using the power of 2. A byte is a sequence of 8 bits. An ASCII character uses one byte of memory (8 bits). Below is a table explaining the data volume unit (Table 2-1).
Google 的 Dean 博士揭示了 2010 年典型计算机操作的长度[1]。随着计算机变得更快、更强大,一些数字已经过时了。然而,这些数字仍然应该能够让我们了解不同计算机操作的快慢。
Dr. Dean from Google reveals the length of typical computer operations in 2010 [1]. Some numbers are outdated as computers become faster and more powerful. However, those numbers should still be able to give us an idea of the fastness and slowness of different computer operations.
笔记
Notes
------------
-----------
ns = 纳秒,μs = 微秒,ms = 毫秒
ns = nanosecond, µs = microsecond, ms = millisecond
1 纳秒 = 10^-9 秒
1 ns = 10^-9 seconds
1 µs= 10^-6 秒 = 1,000 ns
1 µs= 10^-6 seconds = 1,000 ns
1 毫秒 = 10^-3 秒 = 1,000 微秒 = 1,000,000 纳秒
1 ms = 10^-3 seconds = 1,000 µs = 1,000,000 ns
一位谷歌软件工程师开发了一个工具来可视化迪恩博士的数据。该工具还考虑了时间因素。图 2-1 显示了截至 2020 年的可视化延迟数字(图片来源:参考资料 [3])。
A Google software engineer built a tool to visualize Dr. Dean’s numbers. The tool also takes the time factor into consideration. Figures 2-1 shows the visualized latency numbers as of 2020 (source of figures: reference material [3]).
通过分析图2-1中的数字,我们得到以下结论:
By analyzing the numbers in Figure 2-1, we get the following conclusions:
•内存速度很快,但磁盘速度很慢。
•Memory is fast but the disk is slow.
•如果可能,避免磁盘寻道。
•Avoid disk seeks if possible.
•简单的压缩算法速度很快。
•Simple compression algorithms are fast.
•如果可能,请在通过互联网发送数据之前对其进行压缩。
•Compress data before sending it over the internet if possible.
•Data centers are usually in different regions, and it takes time to send data between them.
高可用性是系统在所需的较长时间内持续运行的能力。高可用性以百分比来衡量,100% 表示服务的停机时间为 0。大多数服务的利用率都在 99% 到 100% 之间。
High availability is the ability of a system to be continuously operational for a desirably long period of time. High availability is measured as a percentage, with 100% means a service that has 0 downtime. Most services fall between 99% and 100%.
服务级别协议 (SLA) 是服务提供商常用的术语。这是您(服务提供商)和您的客户之间的协议,该协议正式定义了您的服务将提供的正常运行时间水平。云提供商 Amazon [4]、Google [5] 和 Microsoft [6] 将其 SLA 设置为 99.9% 或以上。正常运行时间传统上以九为单位来衡量。九越多越好。如表 2-3 所示,9 的数量与预期的系统停机时间相关。
A service level agreement (SLA) is a commonly used term for service providers. This is an agreement between you (the service provider) and your customer, and this agreement formally defines the level of uptime your service will deliver. Cloud providers Amazon [4], Google [5] and Microsoft [6] set their SLAs at 99.9% or above. Uptime is traditionally measured in nines. The more the nines, the better. As shown in Table 2-3, the number of nines correlate to the expected system downtime.
请注意,以下数字仅用于本练习,因为它们不是来自 Twitter 的真实数字。
Please note the following numbers are for this exercise only as they are not real numbers from Twitter.
假设:
Assumptions:
•每月3 亿活跃用户。
• 300 million monthly active users.
• 50% 的用户每天使用Twitter。
• 50% of users use Twitter daily.
•用户平均每天发布2 条推文。
• Users post 2 tweets per day on average.
• 10% 的推文包含媒体内容。
• 10% of tweets contain media.
•数据保存5 年。
• Data is stored for 5 years.
估计:
Estimations:
每秒查询 (QPS) 估计:
Query per second (QPS) estimate:
•日活跃用户 (DAU) = 3 亿 * 50% = 1.5 亿
• Daily active users (DAU) = 300 million * 50% = 150 million
•推文 QPS = 1.5 亿 * 2 条推文 / 24 小时 / 3600 秒 = ~3500
• Tweets QPS = 150 million * 2 tweets / 24 hour / 3600 seconds = ~3500
•峰值 QPS = 2 * QPS = ~7000
• Peak QPS = 2 * QPS = ~7000
我们在这里仅估计媒体存储。
We will only estimate media storage here.
•平均推文大小:
• Average tweet size:
• tweet_id 64 字节
• tweet_id 64 bytes
•文本140字节
• text 140 bytes
•媒体 1 MB
• media 1 MB
•媒体存储:每天 1.5 亿 * 2 * 10% * 1 MB = 30 TB
• Media storage: 150 million * 2 * 10% * 1 MB = 30 TB per day
• 5 年媒体存储:30 TB * 365 * 5 = ~55 PB
• 5-year media storage: 30 TB * 365 * 5 = ~55 PB
粗略估计就是关于过程的。解决问题比获得结果更重要。面试官可能会测试你解决问题的能力。以下是一些需要遵循的提示:
Back-of-the-envelope estimation is all about the process. Solving the problem is more important than obtaining results. Interviewers may test your problem-solving skills. Here are a few tips to follow:
•舍入和近似。面试时很难进行复杂的数学运算。例如,“99987 / 9.1”的结果是什么?无需花费宝贵的时间来解决复杂的数学问题。预计精度不高。使用整数和近似值对您有利。除法问题可以简化为:“100,000 / 10”。
•Rounding and Approximation. It is difficult to perform complicated math operations during the interview. For example, what is the result of “99987 / 9.1”? There is no need to spend valuable time to solve complicated math problems. Precision is not expected. Use round numbers and approximation to your advantage. The division question can be simplified as follows: “100,000 / 10”.
•写下您的假设。最好写下您的假设以供以后参考。
•Write down your assumptions. It is a good idea to write down your assumptions to be referenced later.
•给您的单位贴上标签。当你写下“5”时,它是指 5 KB 还是 5 MB?您可能对此感到困惑。写下单位,因为“5 MB”有助于消除歧义。
•Label your units. When you write down “5”, does it mean 5 KB or 5 MB? You might confuse yourself with this. Write down the units because “5 MB” helps to remove ambiguity.
•常见的粗略估计:QPS、峰值QPS、存储、缓存、服务器数量等。您可以在准备面试时练习这些计算。熟能生巧。
•Commonly asked back-of-the-envelope estimations: QPS, peak QPS, storage, cache, number of servers, etc. You can practice these calculations when preparing for an interview. Practice makes perfect.
恭喜您已经走到这一步了!现在拍拍自己的背吧。好工作!
Congratulations on getting this far! Now give yourself a pat on the back. Good job!
参考资料
Reference materials
[1] J. Dean.Google 专业提示:使用粗略计算来选择最佳设计:
[1] J. Dean.Google Pro Tip: Use Back-Of-The-Envelope-Calculations To Choose The Best Design:
[2] 系统设计入门:https://github.com/donnemartin/system-design-primer
[2] System design primer: https://github.com/donnemartin/system-design-primer
[3] 每个程序员都应该知道的延迟数字:
[3] Latency Numbers Every Programmer Should Know:
https://colin-scott.github.io/personal_website/research/interactive_latency.html
https://colin-scott.github.io/personal_website/research/interactive_latency.html
[4] 亚马逊计算服务级别协议:
[4] Amazon Compute Service Level Agreement:
https://aws.amazon.com/compute/sla/
https://aws.amazon.com/compute/sla/
[5] 计算引擎服务级别协议 (SLA):
[5] Compute Engine Service Level Agreement (SLA):
https://cloud.google.com/compute/sla
https://cloud.google.com/compute/sla
[6] Azure 服务的 SLA 摘要:https://azure.microsoft.com/en-us/support/legal/sla/summary/
[6] SLA summary for Azure services: https://azure.microsoft.com/en-us/support/legal/sla/summary/
您刚刚在您梦想的公司获得了令人垂涎的现场面试机会。招聘协调员会向您发送当天的日程安排。顺着列表往下看,你会感觉很不错,直到你的目光落在这个面试环节——系统设计面试上。
You have just landed a coveted on-site interview at your dream company. The hiring coordinator sends you a schedule for that day. Scanning down the list, you feel pretty good about it until your eyes land on this interview session - System Design Interview.
系统设计面试常常令人生畏。它可能像“设计一个知名产品 X?”一样模糊。这些问题含糊不清,而且似乎过于宽泛。你的疲倦是可以理解的。毕竟,谁能在一小时内设计出一款需要数百名甚至数千名工程师打造的流行产品呢?
System design interviews are often intimidating. It could be as vague as “designing a well-known product X?”. The questions are ambiguous and seem unreasonably broad. Your weariness is understandable. After all, how could anyone design a popular product in an hour that has taken hundreds if not thousands of engineers to build?
好消息是没有人期望你这么做。现实世界的系统设计极其复杂。例如,Google 搜索看似简单;然而,支撑这种简单性的技术数量确实令人惊讶。如果没有人期望你在一小时内设计出一个现实世界的系统,那么系统设计面试有什么好处呢?
The good news is that no one expects you to. Real-world system design is extremely complicated. For example, Google search is deceptively simple; however, the amount of technology that underpins that simplicity is truly astonishing. If no one expects you to design a real-world system in an hour, what is the benefit of a system design interview?
系统设计面试模拟现实生活中的问题解决,两名同事合作解决一个模糊的问题,并提出一个满足他们目标的解决方案。这个问题是开放式的,没有完美的答案。与您在设计过程中投入的工作相比,最终的设计并不那么重要。这使您能够展示您的设计技能,捍卫您的设计选择,并以建设性的方式回应反馈。
The system design interview simulates real-life problem solving where two co-workers collaborate on an ambiguous problem and come up with a solution that meets their goals. The problem is open-ended, and there is no perfect answer. The final design is less important compared to the work you put in the design process. This allows you to demonstrate your design skill, defend your design choices, and respond to feedback in a constructive manner.
让我们翻转桌子,想一想当面试官走进会议室来见你时,她脑子里在想什么。面试官的主要目标是准确评估您的能力。她最不想做的就是给出一个不确定的评估,因为会议进行得很糟糕并且没有足够的信号。面试官在系统设计面试中寻找什么?
Let us flip the table and consider what goes through the interviewer’s head as she walks into the conference room to meet you. The primary goal of the interviewer is to accurately assess your abilities. The last thing she wants is to give an inconclusive evaluation because the session has gone poorly and there are not enough signals. What is an interviewer looking for in a system design interview?
许多人认为系统设计面试主要考察的是一个人的技术设计能力。远不止于此。一次有效的系统设计面试可以强烈表明一个人的协作能力、在压力下工作以及建设性地解决歧义的能力。提出好问题的能力也是一项必备技能,许多面试官都特别看重这项技能。
Many think that system design interview is all about a person's technical design skills. It is much more than that. An effective system design interview gives strong signals about a person's ability to collaborate, to work under pressure, and to resolve ambiguity constructively. The ability to ask good questions is also an essential skill, and many interviewers specifically look for this skill.
一个好的面试官也会寻找危险信号。过度设计是许多工程师的真正疾病,因为他们喜欢设计的纯粹性而忽视权衡。他们往往没有意识到过度设计的系统会带来复合成本,许多公司为此付出了高昂的代价。您当然不想在系统设计面试中表现出这种趋势。其他危险信号包括心胸狭隘、固执等。
A good interviewer also looks for red flags. Over-engineering is a real disease of many engineers as they delight in design purity and ignore tradeoffs. They are often unaware of the compounding costs of over-engineered systems, and many companies pay a high price for that ignorance. You certainly do not want to demonstrate this tendency in a system design interview. Other red flags include narrow mindedness, stubbornness, etc.
在本章中,我们将介绍一些有用的技巧,并介绍一个简单有效的框架来解决系统设计面试问题。
In this chapter, we will go over some useful tips and introduce a simple and effective framework to solve system design interview problems.
每个系统设计面试都是不同的。一次出色的系统设计面试是开放式的,并且不存在放之四海而皆准的解决方案。然而,每次系统设计访谈都需要涵盖一些步骤和共同点。
Every system design interview is different. A great system design interview is open-ended and there is no one-size-fits-all solution. However, there are steps and common ground to cover in every system design interview.
“老虎为什么吼叫?”
"Why did the tiger roar?"
一只手在教室后面举了起来。
A hand shot up in the back of the class.
“是的,吉米?”老师回答道。
"Yes, Jimmy?", the teacher responded.
“因为他饿了”。
"Because he was HUNGRY".
“非常好,吉米。”
"Very good Jimmy."
在他的整个童年时期,吉米总是在课堂上第一个回答问题。每当老师问问题时,教室里总有一个孩子喜欢尝试这个问题,不管他是否知道答案。那是吉米。
Throughout his childhood, Jimmy has always been the first to answer questions in the class. Whenever the teacher asks a question, there is always a kid in the classroom who loves to take a crack at the question, no matter if he knows the answer or not. That is Jimmy.
吉米是一名优等生。他为快速知道所有答案而感到自豪。考试时,他通常是第一个做完题的人。他是任何学术竞赛中老师的首选。
Jimmy is an ace student. He takes pride in knowing all the answers fast. In exams, he is usually the first person to finish the questions. He is a teacher's top choice for any academic competition.
不要像吉米那样。
DON'T be like Jimmy.
在系统设计面试中,不假思索地快速给出答案不会给你加分。在没有彻底了解要求的情况下回答是一个巨大的危险信号,因为面试不是一场问答比赛。没有正确的答案。
In a system design interview, giving out an answer quickly without thinking gives you no bonus points. Answering without a thorough understanding of the requirements is a huge red flag as the interview is not a trivia contest. There is no right answer.
因此,不要直接跳出来给出解决方案。减速。深入思考并提出问题以澄清要求和假设。这一点非常重要。
So, do not jump right in to give a solution. Slow down. Think deeply and ask questions to clarify requirements and assumptions. This is extremely important.
作为一名工程师,我们喜欢解决难题并直接进入最终设计;然而,这种方法可能会导致您设计出错误的系统。作为工程师,最重要的技能之一是提出正确的问题,做出正确的假设,并收集构建系统所需的所有信息。所以,不要害怕提问。
As an engineer, we like to solve hard problems and jump into the final design; however, this approach is likely to lead you to design the wrong system. One of the most important skills as an engineer is to ask the right questions, make the proper assumptions, and gather all the information needed to build a system. So, do not be afraid to ask questions.
当你提出问题时,面试官要么直接回答你的问题,要么要求你做出假设。如果发生后者,请在白板或纸上写下您的假设。您稍后可能需要它们。
When you ask a question, the interviewer either answers your question directly or asks you to make your assumptions. If the latter happens, write down your assumptions on the whiteboard or paper. You might need them later.
要问什么样的问题?提出问题以了解确切的要求。以下是帮助您入门的问题列表:
What kind of questions to ask? Ask questions to understand the exact requirements. Here is a list of questions to help you get started:
•我们将构建哪些具体功能?
•What specific features are we going to build?
•该产品有多少用户?
•How many users does the product have?
•公司预计扩大规模的速度有多快?3个月、6个月、一年后的预期规模是多少?
•How fast does the company anticipate to scale up? What are the anticipated scales in 3 months, 6 months, and a year?
•公司的技术堆栈是什么?您可以利用哪些现有服务来简化设计?
•What is the company’s technology stack? What existing services you might leverage to simplify the design?
如果您被要求设计一个新闻源系统,您需要提出一些问题来帮助您阐明需求。你和面试官之间的对话可能是这样的:
If you are asked to design a news feed system, you want to ask questions that help you clarify the requirements. The conversation between you and the interviewer might look like this:
候选人:这是一个移动应用程序吗?或者网络应用程序?或两者?
Candidate: Is this a mobile app? Or a web app? Or both?
采访者:两者都有。
Interviewer: Both.
应聘者:该产品最重要的功能是什么?
Candidate: What are the most important features for the product?
面试官:能够发帖并查看朋友的动态。
Interviewer: Ability to make a post and see friends’ news feed.
候选人:新闻提要是按时间倒序排列还是按特定顺序排列?特定的顺序意味着每个帖子都被赋予不同的权重。例如,来自亲密朋友的帖子比来自群组的帖子更重要。
Candidate: Is the news feed sorted in reverse chronological order or a particular order? The particular order means each post is given a different weight. For instance, posts from your close friends are more important than posts from a group.
采访者:为了简单起见,我们假设 feed 是按时间倒序排列的。
Interviewer: To keep things simple, let us assume the feed is sorted by reverse chronological order.
候选人:一个用户可以有多少个朋友?
Candidate: How many friends can a user have?
面试人数:5000
Interviewer: 5000
应聘者:交通量是多少?
Candidate: What is the traffic volume?
采访者:1000万日活跃用户(DAU)
Interviewer: 10 million daily active users (DAU)
候选人:提要可以包含图像、视频还是仅包含文本?
Candidate: Can feed contain images, videos, or just text?
面试官:它可以包含媒体文件,包括图像和视频。
Interviewer: It can contain media files, including both images and videos.
以上是您可以向面试官询问的一些示例问题。理解要求并澄清歧义很重要
Above are some sample questions that you can ask your interviewer. It is important to understand the requirements and clarify ambiguities
在这一步中,我们的目标是开发一个高层设计并与面试官就设计达成一致。在此过程中与面试官合作是一个好主意。
In this step, we aim to develop a high-level design and reach an agreement with the interviewer on the design. It is a great idea to collaborate with the interviewer during the process.
•提出最初的设计蓝图。寻求反馈。将面试官视为队友并一起工作。许多优秀的面试官喜欢交谈并参与其中。
•Come up with an initial blueprint for the design. Ask for feedback. Treat your interviewer as a teammate and work together. Many good interviewers love to talk and get involved.
•在白板或纸上绘制带有关键组件的框图。这可能包括客户端(移动/网络)、API、网络服务器、数据存储、缓存、CDN、消息队列等。
•Draw box diagrams with key components on the whiteboard or paper. This might include clients (mobile/web), APIs, web servers, data stores, cache, CDN, message queue, etc.
•进行粗略计算以评估您的蓝图是否符合规模限制。大声思考。如果在深入探讨之前需要了解具体情况,请与面试官沟通。
•Do back-of-the-envelope calculations to evaluate if your blueprint fits the scale constraints. Think out loud. Communicate with your interviewer if back-of-the-envelope is necessary before diving into it.
如果可能的话,请浏览一些具体的用例。这将帮助您构建高层设计。这些用例也可能帮助您发现尚未考虑的边缘情况。
If possible, go through a few concrete use cases. This will help you frame the high-level design. It is also likely that the use cases would help you discover edge cases you have not yet considered.
我们应该在此处包含 API 端点和数据库架构吗?这取决于问题。对于像“设计谷歌搜索引擎”这样的大型设计问题,这有点太低级了。对于像多人扑克游戏的后端设计这样的问题,这是一个公平的游戏。与你的面试官沟通。
Should we include API endpoints and database schema here? This depends on the problem. For large design problems like “Design Google search engine”, this is a bit of too low level. For a problem like designing the backend for a multi-player poker game, this is a fair game. Communicate with your interviewer.
让我们以“设计新闻推送系统”来演示如何进行高层设计。在这里,您不需要了解系统的实际工作原理。所有细节将在第 11 章中解释。
Let us use “Design a news feed system” to demonstrate how to approach the high-level design. Here you are not required to understand how the system actually works. All the details will be explained in Chapter 11.
在高层,设计分为两个流程:提要发布和新闻提要构建。
At the high level, the design is divided into two flows: feed publishing and news feed building.
• Feed 发布:当用户发布帖子时,相应的数据将写入缓存/数据库,并且该帖子将被填充到好友的新闻 Feed 中。
•Feed publishing: when a user publishes a post, corresponding data is written into cache/database, and the post will be populated into friends’ news feed.
•新闻源构建:新闻源是通过按时间倒序聚合朋友的帖子来构建的。
•Newsfeed building: the news feed is built by aggregating friends’ posts in a reverse chronological order.
图 3-1 和图 3-2 分别展示了 feed 发布和新闻 feed 构建流程的高级设计。
Figure 3-1 and Figure 3-2 present high-level designs for feed publishing and news feed building flows, respectively.
在此步骤中,您和您的面试官应该已经实现了以下目标:
At this step, you and your interviewer should have already achieved the following objectives:
•就总体目标和功能范围达成一致
•Agreed on the overall goals and feature scope
•勾勒出整体设计的高级蓝图
•Sketched out a high-level blueprint for the overall design
•从面试官处获得有关高层设计的反馈
•Obtained feedback from your interviewer on the high-level design
•根据她的反馈,对深入研究的重点领域有了一些初步想法
•Had some initial ideas about areas to focus on in deep dive based on her feedback
您应与面试官一起识别架构中的组件并确定其优先级。值得强调的是,每次面试都是不同的。有时,面试官可能会暗示她喜欢专注于高层设计。有时,对于高级候选人面试,讨论可能是关于系统性能特征,可能集中在瓶颈和资源估计上。在大多数情况下,面试官可能希望你深入了解一些系统组件的细节。对于 URL 缩短器,深入研究将长 URL 转换为短 URL 的哈希函数设计是很有趣的。对于聊天系统来说,如何减少延迟以及如何支持在线/离线状态是两个有趣的话题。
You shall work with the interviewer to identify and prioritize components in the architecture. It is worth stressing that every interview is different. Sometimes, the interviewer may give off hints that she likes focusing on high-level design. Sometimes, for a senior candidate interview, the discussion could be on the system performance characteristics, likely focusing on the bottlenecks and resource estimations. In most cases, the interviewer may want you to dig into details of some system components. For URL shortener, it is interesting to dive into the hash function design that converts a long URL to a short one. For a chat system, how to reduce latency and how to support online/offline status are two interesting topics.
时间管理至关重要,因为你很容易被无法证明你能力的微小细节冲昏头脑。您必须携带信号向面试官展示。尽量不要涉及不必要的细节。例如,在系统设计面试中详细讨论 Facebook feed 排名的 EdgeRank 算法并不理想,因为这会花费大量宝贵的时间,并且不能证明您设计可扩展系统的能力。
Time management is essential as it is easy to get carried away with minute details that do not demonstrate your abilities. You must be armed with signals to show your interviewer. Try not to get into unnecessary details. For example, talking about the EdgeRank algorithm of Facebook feed ranking in detail is not ideal during a system design interview as this takes much precious time and does not prove your ability in designing a scalable system.
至此,我们已经讨论了新闻源系统的高层设计,面试官对你的建议很满意。接下来,我们将研究两个最重要的用例:
At this point, we have discussed the high-level design for a news feed system, and the interviewer is happy with your proposal. Next, we will investigate two of the most important use cases:
1. Feed 发布
1. Feed publishing
2.新闻提要检索
2. News feed retrieval
图 3-3 和图 3-4 显示了两个用例的详细设计,将在第 11 章中详细解释。
Figure 3-3 and Figure 3-4 show the detailed design for the two use cases, which will be explained in detail in Chapter 11.
在最后一步中,面试官可能会问您一些后续问题,或者让您自由讨论其他附加问题。以下是一些需要遵循的方向:
In this final step, the interviewer might ask you a few follow-up questions or give you the freedom to discuss other additional points. Here are a few directions to follow:
•面试官可能希望您找出系统瓶颈并讨论潜在的改进。永远不要说你的设计是完美的,没有什么可以改进的。总有一些地方需要改进。这是展示您的批判性思维并留下良好的最终印象的绝佳机会。
•The interviewer might want you to identify the system bottlenecks and discuss potential improvements. Never say your design is perfect and nothing can be improved. There is always something to improve upon. This is a great opportunity to show your critical thinking and leave a good final impression.
•让面试官回顾一下您的设计可能会很有用。如果您提出了一些解决方案,这一点尤其重要。经过长时间的面试后,刷新面试官的记忆可能会有所帮助。
•It could be useful to give the interviewer a recap of your design. This is particularly important if you suggested a few solutions. Refreshing your interviewer’s memory can be helpful after a long session.
•错误案例(服务器故障、网络丢失等)值得讨论。
•Error cases (server failure, network loss, etc.) are interesting to talk about.
•操作问题值得一提。您如何监控指标和错误日志?系统如何推广?
•Operation issues are worth mentioning. How do you monitor metrics and error logs? How to roll out the system?
•如何处理下一个尺度曲线也是一个有趣的话题。例如,如果您当前的设计支持 100 万用户,那么您需要进行哪些更改才能支持 1000 万用户?
•How to handle the next scale curve is also an interesting topic. For example, if your current design supports 1 million users, what changes do you need to make to support 10 million users?
•如果您有更多时间,请提出您需要的其他改进。
•Propose other refinements you need if you had more time.
最后,我们总结了一份注意事项清单。
To wrap up, we summarize a list of the Dos and Don’ts.
多斯
Dos
•始终要求澄清。不要假设您的假设是正确的。
•Always ask for clarification. Do not assume your assumption is correct.
•了解问题的要求。
•Understand the requirements of the problem.
•既没有正确答案,也没有最佳答案。旨在解决年轻初创公司问题的解决方案与拥有数百万用户的老牌公司的解决方案不同。确保您了解要求。
•There is neither the right answer nor the best answer. A solution designed to solve the problems of a young startup is different from that of an established company with millions of users. Make sure you understand the requirements.
•让面试官知道您的想法。与您的面试进行沟通。
•Let the interviewer know what you are thinking. Communicate with your interview.
•如果可能,建议多种方法。
•Suggest multiple approaches if possible.
•一旦您与面试官就蓝图达成一致,请详细说明每个组成部分。首先设计最关键的组件。
•Once you agree with your interviewer on the blueprint, go into details on each component. Design the most critical components first.
•向面试官反馈想法。一个好的面试官会像队友一样与你一起工作。
•Bounce ideas off the interviewer. A good interviewer works with you as a teammate.
•永不放弃。
•Never give up.
不该做的事
Don’ts
•不要对典型的面试问题做好准备。
•Don't be unprepared for typical interview questions.
•在没有明确要求和假设的情况下,不要急于寻求解决方案。
•Don’t jump into a solution without clarifying the requirements and assumptions.
•不要一开始就对单个组件讨论太多细节。首先给出高层设计,然后再深入。
•Don’t go into too much detail on a single component in the beginning. Give the high-level design first then drills down.
•如果您遇到困难,请不要犹豫寻求提示。
•If you get stuck, don't hesitate to ask for hints.
•再次强调,沟通。不要默默地思考。
•Again, communicate. Don't think in silence.
•不要认为一旦给出设计,面试就结束了。在面试官说你已经完成之前,你还没有完成。尽早并经常寻求反馈。
•Don’t think your interview is done once you give the design. You are not done until your interviewer says you are done. Ask for feedback early and often.
系统设计面试问题通常非常广泛,45分钟或一个小时不足以涵盖整个设计。时间管理至关重要。每一步应该花多少时间?以下是关于在 45 分钟面试环节中分配时间的非常粗略的指南。请记住,这只是一个粗略的估计,实际的时间分配取决于问题的范围和面试官的要求。
System design interview questions are usually very broad, and 45 minutes or an hour is not enough to cover the entire design. Time management is essential. How much time should you spend on each step? The following is a very rough guide on distributing your time in a 45-minute interview session. Please remember this is a rough estimate, and the actual time distribution depends on the scope of the problem and the requirements from the interviewer.
步骤 1 了解问题并确定设计范围:3 - 10 分钟
Step 1 Understand the problem and establish design scope: 3 - 10 minutes
第 2 步提出高级设计并获得认可:10 - 15 分钟
Step 2 Propose high-level design and get buy-in: 10 - 15 minutes
第 3 步设计深入研究:10 - 25 分钟
Step 3 Design deep dive: 10 - 25 minutes
步骤 4 包裹:3 - 5 分钟
Step 4 Wrap: 3 - 5 minutes
在网络系统中,速率限制器用于控制客户端或服务发送的流量的速率。在 HTTP 世界中,速率限制器限制在指定时间段内允许发送的客户端请求的数量。如果 API 请求计数超过速率限制器定义的阈值,则所有超出的调用都会被阻止。这里有一些例子:
In a network system, a rate limiter is used to control the rate of traffic sent by a client or a service. In the HTTP world, a rate limiter limits the number of client requests allowed to be sent over a specified period. If the API request count exceeds the threshold defined by the rate limiter, all the excess calls are blocked. Here are a few examples:
•用户每秒只能撰写不超过2 个帖子。
•A user can write no more than 2 posts per second.
•您每天最多可以从同一IP 地址创建10 个帐户。
•You can create a maximum of 10 accounts per day from the same IP address.
•您每周从同一设备领取奖励的次数不得超过 5 次。
•You can claim rewards no more than 5 times per week from the same device.
在本章中,您需要设计一个速率限制器。在开始设计之前,我们首先看看使用API速率限制器的好处:
In this chapter, you are asked to design a rate limiter. Before starting the design, we first look at the benefits of using an API rate limiter:
•防止拒绝服务(DoS) 攻击导致资源匮乏[1]。大型科技公司发布的几乎所有 API 都强制执行某种形式的速率限制。例如,Twitter 将推文数量限制为每 3 小时 300 条 [2]。Google 文档 API 具有以下默认限制:每个用户每 60 秒 300 个读取请求 [3]。速率限制器通过阻止多余的呼叫来防止有意或无意的 DoS 攻击。
•Prevent resource starvation caused by Denial of Service (DoS) attack [1]. Almost all APIs published by large tech companies enforce some form of rate limiting. For example, Twitter limits the number of tweets to 300 per 3 hours [2]. Google docs APIs have the following default limit: 300 per user per 60 seconds for read requests [3]. A rate limiter prevents DoS attacks, either intentional or unintentional, by blocking the excess calls.
•降低成本。限制过多的请求意味着更少的服务器并向高优先级 API 分配更多资源。对于使用付费第三方 API 的公司来说,速率限制极其重要。例如,对于以下外部 API,您需要按每次调用付费:检查信用、付款、检索健康记录等。限制调用次数对于降低成本至关重要。
•Reduce cost. Limiting excess requests means fewer servers and allocating more resources to high priority APIs. Rate limiting is extremely important for companies that use paid third party APIs. For example, you are charged on a per-call basis for the following external APIs: check credit, make a payment, retrieve health records, etc. Limiting the number of calls is essential to reduce costs.
•防止服务器过载。为了减少服务器负载,速率限制器用于过滤掉由机器人或用户不当行为引起的过多请求。
•Prevent servers from being overloaded. To reduce server load, a rate limiter is used to filter out excess requests caused by bots or users’ misbehavior.
速率限制可以使用不同的算法来实现,每种算法都有其优点和缺点。面试官和候选人之间的互动有助于澄清我们试图建立的速率限制器的类型。
Rate limiting can be implemented using different algorithms, each with its pros and cons. The interactions between an interviewer and a candidate help to clarify the type of rate limiters we are trying to build.
候选人:我们要设计什么样的速率限制器?是客户端速率限制器还是服务器端 API 速率限制器?
Candidate: What kind of rate limiter are we going to design? Is it a client-side rate limiter or server-side API rate limiter?
采访者:问得好。我们重点关注服务器端 API 速率限制器。
Interviewer: Great question. We focus on the server-side API rate limiter.
候选人:速率限制器是否根据 IP、用户 ID 或其他属性来限制 API 请求?
Candidate: Does the rate limiter throttle API requests based on IP, the user ID, or other properties?
面试官:速率限制器应该足够灵活,以支持不同的节流规则集。
Interviewer: The rate limiter should be flexible enough to support different sets of throttle rules.
候选人:系统的规模有多大?它是为初创公司还是为拥有大量用户群的大公司而构建的?
Candidate: What is the scale of the system? Is it built for a startup or a big company with a large user base?
面试官:系统必须能够处理大量的请求。
Interviewer: The system must be able to handle a large number of requests.
考生:系统能在分布式环境下工作吗?
Candidate: Will the system work in a distributed environment?
采访者:是的。
Interviewer: Yes.
候选人:速率限制器是一个单独的服务还是应该在应用程序代码中实现?
Candidate: Is the rate limiter a separate service or should it be implemented in application code?
采访者:这是你的设计决定。
Interviewer: It is a design decision up to you.
候选人:我们需要通知被限制的用户吗?
Candidate: Do we need to inform users who are throttled?
采访者:是的。
Interviewer: Yes.
要求
Requirements
以下是系统要求的摘要:
Here is a summary of the requirements for the system:
•准确限制过多的请求。
•Accurately limit excessive requests.
•低延迟。速率限制器不应减慢 HTTP 响应时间。
•Low latency. The rate limiter should not slow down HTTP response time.
•使用尽可能少的内存。
•Use as little memory as possible.
•分布式速率限制。速率限制器可以在多个服务器或进程之间共享。
•Distributed rate limiting. The rate limiter can be shared across multiple servers or processes.
•异常处理。当用户的请求受到限制时,向用户显示明确的异常情况。
•Exception handling. Show clear exceptions to users when their requests are throttled.
•高容错能力。如果速率限制器出现任何问题(例如缓存服务器离线),不会影响整个系统。
•High fault tolerance. If there are any problems with the rate limiter (for example, a cache server goes offline), it does not affect the entire system.
让我们保持简单并使用基本的客户端和服务器模型进行通信。
Let us keep things simple and use a basic client and server model for communication.
直观地说,您可以在客户端或服务器端实现速率限制器。
Intuitively, you can implement a rate limiter at either the client or server-side.
•客户端实施。一般来说,客户端是一个不可靠的实施速率限制的地方,因为客户端请求很容易被恶意行为者伪造。此外,我们可能无法控制客户端的实现。
•Client-side implementation. Generally speaking, client is an unreliable place to enforce rate limiting because client requests can easily be forged by malicious actors. Moreover, we might not have control over the client implementation.
•服务器端实施。图4-1显示了放置在服务器端的速率限制器。
•Server-side implementation. Figure 4-1 shows a rate limiter that is placed on the server-side.
除了客户端和服务器端实现之外,还有一种替代方法。我们没有在 API 服务器上放置速率限制器,而是创建一个速率限制器中间件,它可以限制对 API 的请求,如图 4-2 所示。
Besides the client and server-side implementations, there is an alternative way. Instead of putting a rate limiter at the API servers, we create a rate limiter middleware, which throttles requests to your APIs as shown in Figure 4-2.
让我们使用图 4-3 中的示例来说明速率限制在此设计中的工作原理。假设我们的 API 每秒允许 2 个请求,客户端在一秒钟内向服务器发送 3 个请求。前两个请求将路由到 API 服务器。但是,速率限制器中间件会限制第三个请求并返回 HTTP 状态代码 429。HTTP 429 响应状态代码表示用户发送了太多请求。
Let us use an example in Figure 4-3 to illustrate how rate limiting works in this design. Assume our API allows 2 requests per second, and a client sends 3 requests to the server within a second. The first two requests are routed to API servers. However, the rate limiter middleware throttles the third request and returns a HTTP status code 429. The HTTP 429 response status code indicates a user has sent too many requests.
云微服务 [4] 已经广泛流行,并且速率限制通常在称为 API 网关的组件内实现。API网关是一个完全托管的服务,支持速率限制、SSL终止、身份验证、IP白名单、静态内容服务等。现在,我们只需要知道API网关是一个支持速率限制的中间件。
Cloud microservices [4] have become widely popular and rate limiting is usually implemented within a component called API gateway. API gateway is a fully managed service that supports rate limiting, SSL termination, authentication, IP whitelisting, servicing static content, etc. For now, we only need to know that the API gateway is a middleware that supports rate limiting.
在设计速率限制器时,要问自己的一个重要问题是:速率限制器应该在哪里实现,在服务器端还是在网关中?没有绝对的答案。这取决于您公司当前的技术堆栈、工程资源、优先级、目标等。以下是一些一般准则:
While designing a rate limiter, an important question to ask ourselves is: where should the rater limiter be implemented, on the server-side or in a gateway? There is no absolute answer. It depends on your company’s current technology stack, engineering resources, priorities, goals, etc. Here are a few general guidelines:
•评估您当前的技术堆栈,例如编程语言、缓存服务等。确保您当前的编程语言能够有效地在服务器端实现速率限制。
•Evaluate your current technology stack, such as programming language, cache service, etc. Make sure your current programming language is efficient to implement rate limiting on the server-side.
•确定适合您的业务需求的速率限制算法。当您在服务器端实现所有内容时,您就可以完全控制算法。但是,如果您使用第三方网关,您的选择可能会受到限制。
•Identify the rate limiting algorithm that fits your business needs. When you implement everything on the server-side, you have full control of the algorithm. However, your choice might be limited if you use a third-party gateway.
•如果您已经使用了微服务架构,并且在设计中包含了API 网关来执行身份验证、IP 白名单等操作,您可以在API 网关中添加速率限制器。
•If you have already used microservice architecture and included an API gateway in the design to perform authentication, IP whitelisting, etc., you may add a rate limiter to the API gateway.
•构建您自己的速率限制服务需要时间。如果您没有足够的工程资源来实施速率限制器,商业 API 网关是更好的选择。
•Building your own rate limiting service takes time. If you do not have enough engineering resources to implement a rate limiter, a commercial API gateway is a better option.
速率限制可以使用不同的算法来实现,每种算法都有不同的优点和缺点。尽管本章不重点讨论算法,但从高层理解它们有助于选择正确的算法或算法组合来适合我们的用例。以下是流行算法的列表:
Rate limiting can be implemented using different algorithms, and each of them has distinct pros and cons. Even though this chapter does not focus on algorithms, understanding them at high-level helps to choose the right algorithm or combination of algorithms to fit our use cases. Here is a list of popular algorithms:
•令牌桶
•Token bucket
•桶漏水
•Leaking bucket
•固定窗口柜台
•Fixed window counter
•滑动窗口日志
•Sliding window log
令牌桶算法广泛用于限速。它简单、易于理解并且被互联网公司广泛使用。Amazon [5] 和 Stripe [6] 都使用此算法来限制其 API 请求。
The token bucket algorithm is widely used for rate limiting. It is simple, well understood and commonly used by internet companies. Both Amazon [5] and Stripe [6] use this algorithm to throttle their API requests.
令牌桶算法的工作原理如下:
The token bucket algorithm work as follows:
•令牌桶是具有预定义容量的容器。令牌会定期以预设的速率放入桶中。一旦桶满了,就不再添加令牌。如图4-4所示,令牌桶容量为4个。加注器每秒向令牌桶中放入2个令牌。一旦桶满了,多余的令牌就会溢出。
•A token bucket is a container that has pre-defined capacity. Tokens are put in the bucket at preset rates periodically. Once the bucket is full, no more tokens are added. As shown in Figure 4-4, the token bucket capacity is 4. The refiller puts 2 tokens into the bucket every second. Once the bucket is full, extra tokens will overflow.
•每个请求消耗一个令牌。当请求到达时,我们检查存储桶中是否有足够的令牌。图 4-5 解释了它的工作原理。
•Each request consumes one token. When a request arrives, we check if there are enough tokens in the bucket. Figure 4-5 explains how it works.
•如果有足够的令牌,我们会为每个请求取出一个令牌,然后请求就会通过。
•If there are enough tokens, we take one token out for each request, and the request goes through.
•如果没有足够的令牌,则请求将被丢弃。
•If there are not enough tokens, the request is dropped.
图 4-6 说明了令牌消耗、充值和速率限制逻辑的工作原理。在本例中,令牌桶大小为4,重新填充率为每1分钟4个。
Figure 4-6 illustrates how token consumption, refill, and rate limiting logic work. In this example, the token bucket size is 4, and the refill rate is 4 per 1 minute.
令牌桶算法有两个参数:
The token bucket algorithm takes two parameters:
•桶大小:桶中允许的最大令牌数
•Bucket size: the maximum number of tokens allowed in the bucket
•重新填充率:每秒放入桶中的令牌数量
•Refill rate: number of tokens put into the bucket every second
我们需要多少个桶?这会有所不同,并且取决于速率限制规则。这里有一些例子。
How many buckets do we need? This varies, and it depends on the rate-limiting rules. Here are a few examples.
•通常需要为不同的API 端点设置不同的存储桶。例如,如果允许用户每秒发1个帖子,每天添加150个好友,每秒发5个帖子,则每个用户需要3个桶。
•It is usually necessary to have different buckets for different API endpoints. For instance, if a user is allowed to make 1 post per second, add 150 friends per day, and like 5 posts per second, 3 buckets are required for each user.
•如果我们需要根据IP 地址限制请求,则每个IP 地址都需要一个存储桶。
•If we need to throttle requests based on IP addresses, each IP address requires a bucket.
•如果系统每秒最多允许 10,000 个请求,则让所有请求共享一个全局存储桶是有意义的。
•If the system allows a maximum of 10,000 requests per second, it makes sense to have a global bucket shared by all requests.
优点:
Pros:
•该算法易于实现。
•The algorithm is easy to implement.
•内存效率高。
•Memory efficient.
•令牌桶允许短时间的流量突发。只要还有令牌,请求就可以通过。
•Token bucket allows a burst of traffic for short periods. A request can go through as long as there are tokens left.
缺点:
Cons:
•算法中的两个参数是桶大小和令牌填充率。然而,正确调整它们可能具有挑战性。
•Two parameters in the algorithm are bucket size and token refill rate. However, it might be challenging to tune them properly.
漏桶算法与令牌桶类似,只是以固定速率处理请求。它通常用先进先出(FIFO)队列来实现。该算法的工作原理如下:
The leaking bucket algorithm is similar to the token bucket except that requests are processed at a fixed rate. It is usually implemented with a first-in-first-out (FIFO) queue. The algorithm works as follows:
•当请求到达时,系统检查队列是否已满。如果未满,则将请求添加到队列中。
•When a request arrives, the system checks if the queue is full. If it is not full, the request is added to the queue.
•否则,请求将被丢弃。
•Otherwise, the request is dropped.
•请求从队列中提取并定期处理。
•Requests are pulled from the queue and processed at regular intervals.
图 4-7 解释了该算法的工作原理。
Figure 4-7 explains how the algorithm works.
漏桶算法采用以下两个参数:
Leaking bucket algorithm takes the following two parameters:
•桶大小:等于队列大小。队列以固定速率保存要处理的请求。
•Bucket size: it is equal to the queue size. The queue holds the requests to be processed at a fixed rate.
•流出率:定义以固定速率(通常以秒为单位)可以处理多少个请求。
•Outflow rate: it defines how many requests can be processed at a fixed rate, usually in seconds.
电子商务公司 Shopify 使用漏桶进行速率限制 [7]。
Shopify, an ecommerce company, uses leaky buckets for rate-limiting [7].
优点:
Pros:
•考虑到队列大小有限,内存效率较高。
•Memory efficient given the limited queue size.
•请求以固定速率处理,因此适合需要稳定流出速率的用例。
•Requests are processed at a fixed rate therefore it is suitable for use cases that a stable outflow rate is needed.
缺点:
Cons:
•突发流量将旧请求填满队列,如果不及时处理,最近的请求将受到速率限制。
•A burst of traffic fills up the queue with old requests, and if they are not processed in time, recent requests will be rate limited.
•There are two parameters in the algorithm. It might not be easy to tune them properly.
固定窗口计数器算法的工作原理如下:
Fixed window counter algorithm works as follows:
•该算法将时间线划分为固定大小的时间窗口,并为每个窗口分配一个计数器。
•The algorithm divides the timeline into fix-sized time windows and assign a counter for each window.
•每个请求都会使计数器加一。
•Each request increments the counter by one.
•一旦计数器达到预定义的阈值,新请求就会被丢弃,直到新的时间窗口开始。
•Once the counter reaches the pre-defined threshold, new requests are dropped until a new time window starts.
让我们用一个具体的例子来看看它是如何工作的。图4-8中,时间单位为1秒,系统每秒最多允许3个请求。在每个第二个窗口中,如果收到超过 3 个请求,则会丢弃额外的请求,如图 4-8 所示。
Let us use a concrete example to see how it works. In Figure 4-8, the time unit is 1 second and the system allows a maximum of 3 requests per second. In each second window, if more than 3 requests are received, extra requests are dropped as shown in Figure 4-8.
该算法的一个主要问题是,时间窗口边缘的流量突发可能会导致超过允许配额通过的请求。考虑以下情况:
A major problem with this algorithm is that a burst of traffic at the edges of time windows could cause more requests than allowed quota to go through. Consider the following case:
在图4-9中,系统每分钟最多允许5个请求,并且可用配额在人性化的回合分钟重置。如图所示,2:00:00 到 2:01:00 之间有 5 个请求,2:01:00 到 2:02:00 之间还有 5 个请求。对于 2:00:30 到 2:01:30 之间的一分钟窗口,会处理 10 个请求。这是允许请求的两倍。
In Figure 4-9, the system allows a maximum of 5 requests per minute, and the available quota resets at the human-friendly round minute. As seen, there are five requests between 2:00:00 and 2:01:00 and five more requests between 2:01:00 and 2:02:00. For the one-minute window between 2:00:30 and 2:01:30, 10 requests go through. That is twice as many as allowed requests.
优点:
Pros:
•内存效率高。
•Memory efficient.
•易于理解。
•Easy to understand.
•在单位时间窗口结束时重置可用配额适合某些用例。
•Resetting available quota at the end of a unit time window fits certain use cases.
缺点:
Cons:
•窗口边缘的流量峰值可能会导致请求数量超过允许的配额。
•Spike in traffic at the edges of a window could cause more requests than the allowed quota to go through.
如前所述,固定窗口计数器算法有一个主要问题:它允许更多请求在窗口边缘通过。滑动窗口日志算法解决了这个问题。其工作原理如下:
As discussed previously, the fixed window counter algorithm has a major issue: it allows more requests to go through at the edges of a window. The sliding window log algorithm fixes the issue. It works as follows:
•该算法跟踪请求时间戳。时间戳数据通常保存在缓存中,例如 Redis 的排序集 [8]。
•The algorithm keeps track of request timestamps. Timestamp data is usually kept in cache, such as sorted sets of Redis [8].
•当收到新请求时,删除所有过时的时间戳。过时的时间戳定义为早于当前时间窗口开始的时间戳。
•When a new request comes in, remove all the outdated timestamps. Outdated timestamps are defined as those older than the start of the current time window.
•将新请求的时间戳添加到日志中。
•Add timestamp of the new request to the log.
•如果日志大小等于或小于允许的计数,则接受请求。否则,将被拒绝。
•If the log size is the same or lower than the allowed count, a request is accepted. Otherwise, it is rejected.
我们用图 4-10 所示的示例来解释该算法。
We explain the algorithm with an example as revealed in Figure 4-10.
在此示例中,速率限制器每分钟允许 2 个请求。通常,Linux 时间戳存储在日志中。然而,为了更好的可读性,我们的示例中使用了人类可读的时间表示。
In this example, the rate limiter allows 2 requests per minute. Usually, Linux timestamps are stored in the log. However, human-readable representation of time is used in our example for better readability.
•当新请求在1:00:01 到达时,日志为空。因此,该请求被允许。
•The log is empty when a new request arrives at 1:00:01. Thus, the request is allowed.
•新请求在1:00:30 到达,时间戳1:00:30 被插入到日志中。插入后,日志大小为2,不大于允许的计数。因此,该请求被允许。
•A new request arrives at 1:00:30, the timestamp 1:00:30 is inserted into the log. After the insertion, the log size is 2, not larger than the allowed count. Thus, the request is allowed.
•新请求在1:00:50 到达,时间戳被插入到日志中。插入后,日志大小为3,大于允许的大小2。因此,即使日志中保留了时间戳,该请求也会被拒绝。
•A new request arrives at 1:00:50, and the timestamp is inserted into the log. After the insertion, the log size is 3, larger than the allowed size 2. Therefore, this request is rejected even though the timestamp remains in the log.
•新请求于 1:01:40 到达。[1:00:40,1:01:40) 范围内的请求属于最新时间范围内的请求,但 1:00:40 之前发送的请求已过时。两个过时的时间戳 1:00:01 和 1:00:30 已从日志中删除。进行remove操作后,日志大小变为2;因此,请求被接受。
•A new request arrives at 1:01:40. Requests in the range [1:00:40,1:01:40) are within the latest time frame, but requests sent before 1:00:40 are outdated. Two outdated timestamps, 1:00:01 and 1:00:30, are removed from the log. After the remove operation, the log size becomes 2; therefore, the request is accepted.
优点:
Pros:
•该算法实现的速率限制非常准确。在任何滚动窗口中,请求都不会超过速率限制。
•Rate limiting implemented by this algorithm is very accurate. In any rolling window, requests will not exceed the rate limit.
•该算法会消耗大量内存,因为即使请求被拒绝,其时间戳仍可能存储在内存中。
•The algorithm consumes a lot of memory because even if a request is rejected, its timestamp might still be stored in memory.
滑动窗口计数器算法是一种结合了固定窗口计数器和滑动窗口日志的混合方法。该算法可以通过两种不同的方法来实现。我们将在本节中解释一种实现,并在本节末尾为另一种实现提供参考。图 4-11 说明了该算法的工作原理。
The sliding window counter algorithm is a hybrid approach that combines the fixed window counter and sliding window log. The algorithm can be implemented by two different approaches. We will explain one implementation in this section and provide reference for the other implementation at the end of the section. Figure 4-11 illustrates how this algorithm works.
假设速率限制器每分钟最多允许 7 个请求,前一分钟有 5 个请求,当前分钟有 3 个请求。对于当前分钟内达到30%位置的新请求,滚动窗口内的请求数使用以下公式计算:
Assume the rate limiter allows a maximum of 7 requests per minute, and there are 5 requests in the previous minute and 3 in the current minute. For a new request that arrives at a 30% position in the current minute, the number of requests in the rolling window is calculated using the following formula:
•当前窗口的请求数+上一个窗口的请求数*滚动窗口与上一个窗口的重叠百分比
•Requests in current window + requests in the previous window * overlap percentage of the rolling window and previous window
•使用此公式,我们得到3 + 5 * 0.7% = 6.5 个请求。根据用例,该数字可以向上或向下舍入。在我们的示例中,它向下舍入为 6。
•Using this formula, we get 3 + 5 * 0.7% = 6.5 request. Depending on the use case, the number can either be rounded up or down. In our example, it is rounded down to 6.
由于速率限制器每分钟最多允许 7 个请求,因此当前请求可以通过。但是,再收到一次请求后就会达到限制。
Since the rate limiter allows a maximum of 7 requests per minute, the current request can go through. However, the limit will be reached after receiving one more request.
由于篇幅限制,我们这里不讨论其他实现。有兴趣的读者可以参考参考资料[9]。这个算法并不完美。它有优点也有缺点。
Due to the space limitation, we will not discuss the other implementation here. Interested readers should refer to the reference material [9]. This algorithm is not perfect. It has pros and cons.
优点
Pros
•它可以消除流量峰值,因为速率基于前一个窗口的平均速率。
•It smooths out spikes in traffic because the rate is based on the average rate of the previous window.
•内存效率高。
•Memory efficient.
缺点
Cons
•它仅适用于不太严格的回顾窗口。它是实际速率的近似值,因为它假设前一个窗口中的请求是均匀分布的。然而,这个问题可能并不像看起来那么糟糕。根据 Cloudflare [10] 所做的实验,在 4 亿个请求中,只有 0.003% 的请求被错误允许或速率限制。
•It only works for not-so-strict look back window. It is an approximation of the actual rate because it assumes requests in the previous window are evenly distributed. However, this problem may not be as bad as it seems. According to experiments done by Cloudflare [10], only 0.003% of requests are wrongly allowed or rate limited among 400 million requests.
速率限制算法的基本思想很简单。在高层,我们需要一个计数器来跟踪从同一用户、IP 地址等发送的请求数量。如果计数器大于限制,则该请求将被禁止。
The basic idea of rate limiting algorithms is simple. At the high-level, we need a counter to keep track of how many requests are sent from the same user, IP address, etc. If the counter is larger than the limit, the request is disallowed.
我们应该在哪里存放柜台?由于磁盘访问速度缓慢,使用数据库并不是一个好主意。选择内存缓存是因为它速度快并且支持基于时间的过期策略。例如,Redis [11] 是实现速率限制的流行选项。它是一个内存存储,提供两个命令:INCR 和 EXPIRE。
Where shall we store counters? Using the database is not a good idea due to slowness of disk access. In-memory cache is chosen because it is fast and supports time-based expiration strategy. For instance, Redis [11] is a popular option to implement rate limiting. It is an in-memory store that offers two commands: INCR and EXPIRE.
• INCR:将存储的计数器加1。
•INCR: It increases the stored counter by 1.
• EXPIRE:它设置计数器的超时时间。如果超时,计数器将自动删除。
•EXPIRE: It sets a timeout for the counter. If the timeout expires, the counter is automatically deleted.
图 4-12 显示了速率限制的高级架构,其工作原理如下:
Figure 4-12 shows the high-level architecture for rate limiting, and this works as follows:
•客户端向速率限制中间件发送请求。
•The client sends a request to rate limiting middleware.
•速率限制中间件从Redis 中相应的存储桶中获取计数器,并检查是否达到限制。
•Rate limiting middleware fetches the counter from the corresponding bucket in Redis and checks if the limit is reached or not.
•如果达到限制,则请求将被拒绝。
•If the limit is reached, the request is rejected.
•如果未达到限制,请求将发送到API 服务器。同时,系统增加计数器并将其保存回Redis。
•If the limit is not reached, the request is sent to API servers. Meanwhile, the system increments the counter and saves it back to Redis.
图 4-12 中的高层设计没有回答以下问题:
The high-level design in Figure 4-12 does not answer the following questions:
•如何创建速率限制规则?规则存储在哪里?
•How are rate limiting rules created? Where are the rules stored?
•如何处理速率受限的请求?
•How to handle requests that are rate limited?
在本节中,我们将首先回答有关速率限制规则的问题,然后介绍处理速率限制请求的策略。最后,我们将讨论分布式环境中的速率限制、详细设计、性能优化和监控。
In this section, we will first answer the questions regarding rate limiting rules and then go over the strategies to handle rate-limited requests. Finally, we will discuss rate limiting in distributed environment, a detailed design, performance optimization and monitoring.
Lyft 开源了他们的速率限制组件 [12]。我们将查看组件内部并查看一些速率限制规则的示例:
Lyft open-sourced their rate-limiting component [12]. We will peek inside of the component and look at some examples of rate limiting rules:
域:消息传递
domain: messaging
描述符:
descriptors:
-键:消息类型
- key: message_type
价值:营销
Value: marketing
速率限制:
rate_limit:
单位:天
unit: day
每个单元的请求数:5
requests_per_unit: 5
在上面的示例中,系统配置为每天最多允许 5 条营销消息。这是另一个例子:
In the above example, the system is configured to allow a maximum of 5 marketing messages per day. Here is another example:
域:身份验证
domain: auth
描述符:
descriptors:
-密钥:auth_type
- key: auth_type
值:登录
Value: login
速率限制:
rate_limit:
单位:分钟
unit: minute
每个单元的请求数:5
requests_per_unit: 5
该规则表明客户端在 1 分钟内登录次数不得超过 5 次。规则通常写入配置文件并保存在磁盘上。
This rule shows that clients are not allowed to login more than 5 times in 1 minute. Rules are generally written in configuration files and saved on disk.
如果请求受到速率限制,API 会向客户端返回 HTTP 响应代码 429(请求过多)。根据用例,我们可能会将速率受限的请求排队以便稍后处理。例如,如果某些订单由于系统过载而受到速率限制,我们可能会保留这些订单以供稍后处理。
In case a request is rate limited, APIs return a HTTP response code 429 (too many requests) to the client. Depending on the use cases, we may enqueue the rate-limited requests to be processed later. For example, if some orders are rate limited due to system overload, we may keep those orders to be processed later.
客户端如何知道自己是否受到限制?客户端如何知道在被限制之前允许的剩余请求数?答案在于 HTTP 响应标头。速率限制器向客户端返回以下 HTTP 标头:
How does a client know whether it is being throttled? And how does a client know the number of allowed remaining requests before being throttled? The answer lies in HTTP response headers. The rate limiter returns the following HTTP headers to clients:
X-Ratelimit-Remaining :窗口内允许的剩余请求数。
X-Ratelimit-Remaining: The remaining number of allowed requests within the window.
X-Ratelimit-Limit:指示客户端在每个时间窗口可以进行多少次调用。
X-Ratelimit-Limit: It indicates how many calls the client can make per time window.
X-Ratelimit-Retry-After:在不受限制的情况下再次发出请求之前等待的秒数。
X-Ratelimit-Retry-After: The number of seconds to wait until you can make a request again without being throttled.
当用户发送过多请求时,429 请求过多错误和X-Ratelimit-Retry-After标头将返回给客户端。
When a user has sent too many requests, a 429 too many requests error and X-Ratelimit-Retry-After header are returned to the client.
图4-13给出了系统的详细设计。
Figure 4-13 presents a detailed design of the system.
•规则存储在磁盘上。工作人员经常从磁盘中提取规则并将其存储在缓存中。
•Rules are stored on the disk. Workers frequently pull rules from the disk and store them in the cache.
•当客户端向服务器发送请求时,该请求首先被发送到限速器中间件。
•When a client sends a request to the server, the request is sent to the rate limiter middleware first.
•速率限制器中间件从缓存加载规则。它从 Redis 缓存中获取计数器和上次请求时间戳。根据响应,速率限制器决定:
•Rate limiter middleware loads rules from the cache. It fetches counters and last request timestamp from Redis cache. Based on the response, the rate limiter decides:
•如果请求不受速率限制,则将其转发到API 服务器。
•if the request is not rate limited, it is forwarded to API servers.
•如果请求受到速率限制,则速率限制器会向客户端返回 429 请求过多错误。与此同时,请求要么被丢弃,要么被转发到队列。
•if the request is rate limited, the rate limiter returns 429 too many requests error to the client. In the meantime, the request is either dropped or forwarded to the queue.
构建在单服务器环境中工作的速率限制器并不困难。然而,扩展系统以支持多个服务器和并发线程则是另一回事。有两个挑战:
Building a rate limiter that works in a single server environment is not difficult. However, scaling the system to support multiple servers and concurrent threads is a different story. There are two challenges:
•竞争条件
•Race condition
•同步问题
•Synchronization issue
如前所述,速率限制器在高层的工作原理如下:
As discussed earlier, rate limiter works as follows at the high-level:
•从Redis读取计数器值。
•Read the counter value from Redis.
•检查( counter + 1 )是否超过阈值。
•Check if (counter + 1) exceeds the threshold.
•如果不是,则将Redis 中的计数器值加1。
•If not, increment the counter value by 1 in Redis.
竞争条件可能发生在高度并发的环境中,如图 4-14 所示。
Race conditions can happen in a highly concurrent environment as shown in Figure 4-14.
假设Redis 中的计数器值为 3。如果两个请求在其中任何一个请求将值写回之前并发读取计数器值,则每个请求都会将计数器加一并将其写回,而不检查另一个线程。两个请求(线程)都认为它们具有正确的计数器值 4。但是,正确的计数器值应该是 5。
Assume the counter value in Redis is 3. If two requests concurrently read the counter value before either of them writes the value back, each will increment the counter by one and write it back without checking the other thread. Both requests (threads) believe they have the correct counter value 4. However, the correct counter value should be 5.
锁是解决竞争条件最明显的解决方案。然而,锁会显着减慢系统速度。通常使用两种策略来解决该问题:Lua脚本[13]和Redis中的排序集数据结构[8]。对这些策略感兴趣的读者可以参考相应的参考资料[8][13]。
Locks are the most obvious solution for solving race condition. However, locks will significantly slow down the system. Two strategies are commonly used to solve the problem: Lua script [13] and sorted sets data structure in Redis [8]. For readers interested in these strategies, refer to the corresponding reference materials [8] [13].
同步是分布式环境中需要考虑的另一个重要因素。为了支持数百万用户,一台速率限制器服务器可能不足以处理流量。当使用多个限速服务器时,需要同步。例如,在图 4-15 的左侧,客户端 1 向速率限制器 1 发送请求,客户端 2 向速率限制器 2 发送请求。由于 Web 层是无状态的,因此客户端可以向不同的速率限制器发送请求,如图所示如图4-15右侧所示。如果没有同步,则限速器1不包含客户端2的任何数据,因此限速器无法正常工作。
Synchronization is another important factor to consider in a distributed environment. To support millions of users, one rate limiter server might not be enough to handle the traffic. When multiple rate limiter servers are used, synchronization is required. For example, on the left side of Figure 4-15, client 1 sends requests to rate limiter 1, and client 2 sends requests to rate limiter 2. As the web tier is stateless, clients can send requests to a different rate limiter as shown on the right side of Figure 4-15. If no synchronization happens, rate limiter 1 does not contain any data about client 2. Thus, the rate limiter cannot work properly.
一种可能的解决方案是使用粘性会话,允许客户端将流量发送到相同的速率限制器。这种解决方案并不可取,因为它既不可扩展也不灵活。更好的方法是使用 Redis 等集中式数据存储。设计如图4-16所示。
One possible solution is to use sticky sessions that allow a client to send traffic to the same rate limiter. This solution is not advisable because it is neither scalable nor flexible. A better approach is to use centralized data stores like Redis. The design is shown in Figure 4-16.
性能优化是系统设计面试中的常见话题。我们将在两个方面进行改进。
Performance optimization is a common topic in system design interviews. We will cover two areas to improve.
首先,多数据中心设置对于速率限制器至关重要,因为远离数据中心的用户的延迟很高。大多数云服务提供商在世界各地建立了许多边缘服务器位置。例如,截至 2020 年 5 月 20 日,Cloudflare 拥有 194 个地理分布的边缘服务器 [14]。流量会自动路由到最近的边缘服务器以减少延迟。
First, multi-data center setup is crucial for a rate limiter because latency is high for users located far away from the data center. Most cloud service providers build many edge server locations around the world. For example, as of 5/20 2020, Cloudflare has 194 geographically distributed edge servers [14]. Traffic is automatically routed to the closest edge server to reduce latency.
其次,使用最终一致性模型同步数据。如果您对最终一致性模型不清楚,请参阅“第 6 章:设计键值存储”中的“一致性”部分。
Second, synchronize data with an eventual consistency model. If you are unclear about the eventual consistency model, refer to the “Consistency” section in “Chapter 6: Design a Key-value Store.”
设置速率限制器后,收集分析数据以检查速率限制器是否有效非常重要。首先,我们要确保:
After the rate limiter is put in place, it is important to gather analytics data to check whether the rate limiter is effective. Primarily, we want to make sure:
•限速算法有效。
•The rate limiting algorithm is effective.
•速率限制规则有效。
•The rate limiting rules are effective.
例如,如果速率限制规则太严格,则许多有效请求会被丢弃。在这种情况下,我们想稍微放宽规则。在另一个例子中,我们注意到当流量突然增加(例如限时抢购)时,我们的速率限制器会变得无效。在这种场景下,我们可能会更换算法来支持突发流量。令牌桶在这里很合适。
For example, if rate limiting rules are too strict, many valid requests are dropped. In this case, we want to relax the rules a little bit. In another example, we notice our rate limiter becomes ineffective when there is a sudden increase in traffic like flash sales. In this scenario, we may replace the algorithm to support burst traffic. Token bucket is a good fit here.
在本章中,我们讨论了不同的速率限制算法及其优缺点。讨论的算法包括:
In this chapter, we discussed different algorithms of rate limiting and their pros/cons. Algorithms discussed include:
•令牌桶
•Token bucket
•桶漏水
•Leaking bucket
•固定窗
•Fixed window
•滑动窗口日志
•Sliding window log
•滑动窗柜台
•Sliding window counter
然后,我们讨论了系统架构、分布式环境中的速率限制器、性能优化和监控。与任何系统设计面试问题类似,如果时间允许,您还可以提及其他谈话要点:
Then, we discussed the system architecture, rate limiter in a distributed environment, performance optimization and monitoring. Similar to any system design interview questions, there are additional talking points you can mention if time allows:
•硬速率限制与软速率限制。
•Hard vs soft rate limiting.
• Hard:请求数不能超过阈值。
•Hard: The number of requests cannot exceed the threshold.
•软:请求可能会在短时间内超过阈值。
•Soft: Requests can exceed the threshold for a short period.
•不同级别的速率限制。在本章中,我们只讨论了应用程序级别(HTTP:第 7 层)的速率限制。可以在其他层应用速率限制。例如,您可以使用 Iptables [15](IP:第 3 层)通过 IP 地址应用速率限制。注:开放系统互连模型(OSI 模型)有 7 层 [16]:第 1 层:物理层,第 2 层:数据链路层,第 3 层:网络层,第 4 层:传输层,第 5 层:会话层,层6:表示层,7层:应用层。
•Rate limiting at different levels. In this chapter, we only talked about rate limiting at the application level (HTTP: layer 7). It is possible to apply rate limiting at other layers. For example, you can apply rate limiting by IP addresses using Iptables [15] (IP: layer 3). Note: The Open Systems Interconnection model (OSI model) has 7 layers [16]: Layer 1: Physical layer, Layer 2: Data link layer, Layer 3: Network layer, Layer 4: Transport layer, Layer 5: Session layer, Layer 6: Presentation layer, Layer 7: Application layer.
•避免受到速率限制。使用最佳实践设计您的客户端:
•Avoid being rate limited. Design your client with best practices:
•使用客户端缓存以避免频繁进行API 调用。
•Use client cache to avoid making frequent API calls.
•了解限制,不要在短时间内发送太多请求。
•Understand the limit and do not send too many requests in a short time frame.
•包含捕获异常或错误的代码,以便您的客户端可以从异常中正常恢复。
•Include code to catch exceptions or errors so your client can gracefully recover from exceptions.
•添加足够的退避时间以重试逻辑。
•Add sufficient back off time to retry logic.
恭喜您已经走到这一步了!现在拍拍自己的背吧。好工作!
Congratulations on getting this far! Now give yourself a pat on the back. Good job!
参考资料
Reference materials
[1] 速率限制策略和技术:https://cloud.google.com/solutions/rate-limiting-strategies-techniques
[1] Rate-limiting strategies and techniques: https://cloud.google.com/solutions/rate-limiting-strategies-techniques
[2] Twitter 速率限制:https://developer.twitter.com/en/docs/basics/rate-limits
[2] Twitter rate limits: https://developer.twitter.com/en/docs/basics/rate-limits
[3] Google 文档使用限制:https://developers.google.com/docs/api/limits
[3] Google docs usage limits: https://developers.google.com/docs/api/limits
[4] IBM 微服务:https://www.ibm.com/cloud/learn/microservices
[4] IBM microservices: https://www.ibm.com/cloud/learn/microservices
[5] 限制 API 请求以获得更好的吞吐量:
[5] Throttle API requests for better throughput:
https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throtdling.html
https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-request-throttling.html
[6] Stripe速率限制器:https://stripe.com/blog/rate-limiters
[6] Stripe rate limiters: https://stripe.com/blog/rate-limiters
[7] Shopify REST Admin API 速率限制:https://help.shopify.com/en/api/reference/rest-admin-api-rate-limits
[7] Shopify REST Admin API rate limits: https://help.shopify.com/en/api/reference/rest-admin-api-rate-limits
[8] 使用 Redis 排序集实现更好的速率限制:https://engineering.classdojo.com/blog/2015/02/06/rolling-rate-limiter/
[8] Better Rate Limiting With Redis Sorted Sets: https://engineering.classdojo.com/blog/2015/02/06/rolling-rate-limiter/
[9] 系统设计 — 速率限制器和数据建模:https://medium.com/@saisandeepmopuri/system-design-rate-limiter-and-data-modelling-9304b0d18250
[9] System Design — Rate limiter and Data modelling: https://medium.com/@saisandeepmopuri/system-design-rate-limiter-and-data-modelling-9304b0d18250
[10] 我们如何构建能够扩展到数百万个域的速率限制:https://blog.cloudflare.com/counting-things-a-lot-of- Different-things/
[10] How we built rate limiting capable of scaling to millions of domains: https://blog.cloudflare.com/counting-things-a-lot-of-different-things/
[11]Redis网站: https: //redis.io/
[11] Redis website: https://redis.io/
[12] Lyft 速率限制:https://github.com/lyft/ratelimit
[12] Lyft rate limiting: https://github.com/lyft/ratelimit
[13] 使用速率限制器扩展 API:https://gist.github.com/ptarjan/e38f45f2dfe601419ca3af937fff574d#request-rate-limiter
[13] Scaling your API with rate limiters: https://gist.github.com/ptarjan/e38f45f2dfe601419ca3af937fff574d#request-rate-limiter
[14] 什么是边缘计算:https://www.cloudflare.com/learning/serverless/glossary/what-is-edge-computing/
[14] What is edge computing: https://www.cloudflare.com/learning/serverless/glossary/what-is-edge-computing/
[15] Iptables 的速率限制请求:https://blog.programster.org/rate-limit-requests-with-iptables
[15] Rate Limit Requests with Iptables: https://blog.programster.org/rate-limit-requests-with-iptables
[16] OSI模型:https://en.wikipedia.org/wiki/OSI_model#Layer_architecture
[16] OSI model: https://en.wikipedia.org/wiki/OSI_model#Layer_architecture
为了实现水平扩展,在服务器之间有效且均匀地分配请求/数据非常重要。一致性哈希是实现这一目标的常用技术。但首先,让我们深入研究一下这个问题。
To achieve horizontal scaling, it is important to distribute requests/data efficiently and evenly across servers. Consistent hashing is a commonly used technique to achieve this goal. But first, let us take an in-depth look at the problem.
如果有n个缓存服务器,平衡负载的常用方法是使用以下哈希方法:
If you have n cache servers, a common way to balance the load is to use the following hash method:
serverIndex = hash(key) % N ,其中N是服务器池的大小。
serverIndex = hash(key) % N, where N is the size of the server pool.
让我们用一个例子来说明它是如何工作的。如表 5-1 所示,我们有 4 台服务器和 8 个字符串键及其哈希值。
Let us use an example to illustrate how it works. As shown in Table 5-1, we have 4 servers and 8 string keys with their hashes.
为了获取存储密钥的服务器,我们执行模运算f(key) % 4 。例如,hash(key0) % 4 = 1表示客户端必须联系服务器 1 才能获取缓存数据。根据表5-1的密钥分布如图5-1所示。
To fetch the server where a key is stored, we perform the modular operation f(key) % 4. For instance, hash(key0) % 4 = 1 means a client must contact server 1 to fetch the cached data. Figure 5-1 shows the distribution of keys based on Table 5-1.
当服务器池的大小固定并且数据分布均匀时,这种方法效果很好。然而,当添加新服务器或删除现有服务器时,就会出现问题。例如,如果服务器 1 离线,服务器池的大小将变为 3。使用相同的哈希函数,我们会得到相同的密钥哈希值。但是应用模运算给我们带来了不同的服务器索引,因为服务器数量减少了 1。我们通过应用哈希 % 3得到如表 5-2 所示的结果:
This approach works well when the size of the server pool is fixed, and the data distribution is even. However, problems arise when new servers are added, or existing servers are removed. For example, if server 1 goes offline, the size of the server pool becomes 3. Using the same hash function, we get the same hash value for a key. But applying modular operation gives us different server indexes because the number of servers is reduced by 1. We get the results as shown in Table 5-2 by applying hash % 3:
基于表5-2的新密钥分配如图5-2所示。
Figure 5-2 shows the new distribution of keys based on Table 5-2.
如图5-2所示,大多数密钥都被重新分配,而不仅仅是原来存储在离线服务器(服务器1)中的密钥。这意味着当服务器1离线时,大多数缓存客户端将连接到错误的服务器来获取数据。这会导致大量缓存未命中。一致性哈希是缓解此问题的有效技术。
As shown in Figure 5-2, most keys are redistributed, not just the ones originally stored in the offline server (server 1). This means that when server 1 goes offline, most cache clients will connect to the wrong servers to fetch data. This causes a storm of cache misses. Consistent hashing is an effective technique to mitigate this problem.
引用自维基百科:“一致性哈希是一种特殊的哈希,当调整哈希表大小并使用一致性哈希时,平均只需要重新映射k/ n个键,其中 k 是键的数量,并且n 是槽的数量。相比之下,在大多数传统哈希表中,数组槽数量的变化会导致几乎所有键都被重新映射[1]”。
Quoted from Wikipedia: "Consistent hashing is a special kind of hashing such that when a hash table is re-sized and consistent hashing is used, only k/n keys need to be remapped on average, where k is the number of keys, and n is the number of slots. In contrast, in most traditional hash tables, a change in the number of array slots causes nearly all keys to be remapped [1]”.
现在我们了解了一致性哈希的定义,让我们看看它是如何工作的。假设使用SHA-1作为哈希函数f,则哈希函数的输出范围为:x0, x1, x2, x3, …, xn 。在密码学中,SHA-1 的哈希空间从 0 到 2^160 - 1。这意味着x0对应于 0,xn对应于 2^160 – 1,中间的所有其他哈希值都在 0 到 2^160 之间- 1、哈希空间如图5-3所示。
Now we understand the definition of consistent hashing, let us find out how it works. Assume SHA-1 is used as the hash function f, and the output range of the hash function is: x0, x1, x2, x3, …, xn. In cryptography, SHA-1’s hash space goes from 0 to 2^160 - 1. That means x0 corresponds to 0, xn corresponds to 2^160 – 1, and all the other hash values in the middle fall between 0 and 2^160 - 1. Figure 5-3 shows the hash space.
By connecting both ends, we get a hash ring as shown in Figure 5-4:
使用相同的哈希函数 f ,我们根据服务器 IP 或名称将服务器映射到环上。如图5-5所示,哈希环上映射了4台服务器。
Using the same hash function f, we map servers based on server IP or name onto the ring. Figure 5-5 shows that 4 servers are mapped on the hash ring.
值得一提的是,这里使用的哈希函数与“重新哈希问题”中的哈希函数不同,并且没有模运算。如图5-6所示,4个缓存键(key0、key1、key2和key3)被哈希到哈希环上
One thing worth mentioning is that hash function used here is different from the one in “the rehashing problem,” and there is no modular operation. As shown in Figure 5-6, 4 cache keys (key0, key1, key2, and key3) are hashed onto the hash ring
为了确定密钥存储在哪个服务器上,我们从环上的密钥位置顺时针旋转,直到找到服务器。图 5-7解释了这个过程。按顺时针方向,密钥0存储在服务器 0上;key1存储在服务器 1上;key2存储在服务器 2上,key3存储在服务器 3上。
To determine which server a key is stored on, we go clockwise from the key position on the ring until a server is found. Figure 5-7 explains this process. Going clockwise, key0 is stored on server 0; key1 is stored on server 1; key2 is stored on server 2 and key3 is stored on server 3.
使用上述逻辑,添加新服务器只需要重新分配一部分密钥。
Using the logic described above, adding a new server will only require redistribution of a fraction of keys.
图5-8中,添加新的服务器4后,只需重新分配key0即可。k1、k2和k3保留在同一服务器上。让我们仔细看看其中的逻辑。在添加服务器 4之前, key0存储在服务器 0上。现在,key0将存储在服务器 4上,因为服务器 4是从key0在环上的位置顺时针方向遇到的第一个服务器。其他密钥不会基于一致性哈希算法重新分配。
In Figure 5-8, after a new server 4 is added, only key0 needs to be redistributed. k1, k2, and k3 remain on the same servers. Let us take a close look at the logic. Before server 4 is added, key0 is stored on server 0. Now, key0 will be stored on server 4 because server 4 is the first server it encounters by going clockwise from key0’s position on the ring. The other keys are not redistributed based on consistent hashing algorithm.
当服务器被删除时,只有一小部分密钥需要使用一致的散列重新分配。在图5-9中,当服务器1被删除时,只有key1必须重新映射到服务器2 。其余按键不受影响。
When a server is removed, only a small fraction of keys require redistribution with consistent hashing. In Figure 5-9, when server 1 is removed, only key1 must be remapped to server 2. The rest of the keys are unaffected.
Karger等人提出了一致性哈希算法。在麻省理工学院[1]。基本步骤是:
The consistent hashing algorithm was introduced by Karger et al. at MIT [1]. The basic steps are:
•使用均匀分布的散列函数将服务器和密钥映射到环上。
•Map servers and keys on to the ring using a uniformly distributed hash function.
•要找出密钥映射到哪个服务器,请从密钥位置顺时针旋转,直到找到环上的第一台服务器。
•To find out which server a key is mapped to, go clockwise from the key position until the first server on the ring is found.
这种方法存在两个问题。首先,考虑到可以添加或删除服务器,不可能在环上为所有服务器保持相同大小的分区。分区是相邻服务器之间的哈希空间。环上分配给每个服务器的分区大小可能非常小或相当大。在图 5-10 中,如果删除s1 ,则s2 的分区(用双向箭头突出显示)将是s0和s3分区的两倍。
Two problems are identified with this approach. First, it is impossible to keep the same size of partitions on the ring for all servers considering a server can be added or removed. A partition is the hash space between adjacent servers. It is possible that the size of the partitions on the ring assigned to each server is very small or fairly large. In Figure 5-10, if s1 is removed, s2’s partition (highlighted with the bidirectional arrows) is twice as large as s0 and s3’s partition.
其次,环上可能存在不均匀的密钥分布。例如,如果服务器映射到图5-11中列出的位置,则大多数密钥存储在服务器2上。然而,服务器1和服务器3没有数据。
Second, it is possible to have a non-uniform key distribution on the ring. For instance, if servers are mapped to positions listed in Figure 5-11, most of the keys are stored on server 2. However, server 1 and server 3 have no data.
A technique called virtual nodes or replicas is used to solve these problems.
虚拟节点指真实节点,每个服务器由环上的多个虚拟节点表示。在图5-12中,服务器0和服务器1都有3个虚拟节点。3是任意选择的;而在现实系统中,虚拟节点的数量要大得多。我们没有使用s0 ,而是使用s0_0、s0_1和 s0_2 来表示环上的服务器 0 。同样,s1_0、s1_1 、s1_2代表环上的服务器1。使用虚拟节点,每个服务器负责多个分区。带有标签s0的分区(边缘)由服务器 0 管理。另一方面,带有标签s1 的分区由服务器 0 管理。服务器 1 .
A virtual node refers to the real node, and each server is represented by multiple virtual nodes on the ring. In Figure 5-12, both server 0 and server 1 have 3 virtual nodes. The 3 is arbitrarily chosen; and in real-world systems, the number of virtual nodes is much larger. Instead of using s0, we have s0_0, s0_1, and s0_2 to represent server 0 on the ring. Similarly, s1_0, s1_1, and s1_2 represent server 1 on the ring. With virtual nodes, each server is responsible for multiple partitions. Partitions (edges) with label s0 are managed by server 0. On the other hand, partitions with label s1 are managed by server 1.
为了找到密钥存储在哪个服务器上,我们从密钥的位置顺时针方向查找环上遇到的第一个虚拟节点。在图 5-13 中,为了找出k0存储在哪台服务器上,我们从k0的位置顺时针查找虚拟节点s1_1 ,它指的是服务器 1 。
To find which server a key is stored on, we go clockwise from the key’s location and find the first virtual node encountered on the ring. In Figure 5-13, to find out which server k0 is stored on, we go clockwise from k0’s location and find virtual node s1_1, which refers to server 1.
随着虚拟节点数量的增加,密钥的分布变得更加均衡。这是因为虚拟节点越多,标准差就越小,从而导致数据分布均衡。标准差衡量数据的分布方式。在线研究 [2] 进行的实验结果表明,对于一两百个虚拟节点,标准差在平均值的 5%(200 个虚拟节点)到 10%(100 个虚拟节点)之间。当我们增加虚拟节点的数量时,标准差会更小。然而,需要更多的空间来存储有关虚拟节点的数据。这是一种权衡,我们可以调整虚拟节点的数量以满足我们的系统要求。
As the number of virtual nodes increases, the distribution of keys becomes more balanced. This is because the standard deviation gets smaller with more virtual nodes, leading to balanced data distribution. Standard deviation measures how data are spread out. The outcome of an experiment carried out by online research [2] shows that with one or two hundred virtual nodes, the standard deviation is between 5% (200 virtual nodes) and 10% (100 virtual nodes) of the mean. The standard deviation will be smaller when we increase the number of virtual nodes. However, more spaces are needed to store data about virtual nodes. This is a tradeoff, and we can tune the number of virtual nodes to fit our system requirements.
添加或删除服务器时,需要重新分配一部分数据。如何找到受影响的范围来重新分配密钥?
When a server is added or removed, a fraction of data needs to be redistributed. How can we find the affected range to redistribute the keys?
如图5-14所示,服务器4加入环中。受影响的范围从s4 (新添加的节点)开始,绕环逆时针移动,直到找到服务器(s3 )。因此,位于s3和s4之间的密钥需要重新分配给s4 。
In Figure 5-14, server 4 is added onto the ring. The affected range starts from s4 (newly added node) and moves anticlockwise around the ring until a server is found (s3). Thus, keys located between s3 and s4 need to be redistributed to s4.
当如图5-15所示删除服务器(s1 )时,受影响的范围从s1 (被删除的节点)开始绕环逆时针移动,直到找到服务器(s0 )。因此,位于s0和s1之间的密钥必须重新分配给s2 。
When a server (s1) is removed as shown in Figure 5-15, the affected range starts from s1 (removed node) and moves anticlockwise around the ring until a server is found (s0). Thus, keys located between s0 and s1 must be redistributed to s2.
在本章中,我们深入讨论了一致性哈希,包括为什么需要它以及它是如何工作的。一致性哈希的好处包括:
In this chapter, we had an in-depth discussion about consistent hashing, including why it is needed and how it works. The benefits of consistent hashing include:
•添加或删除服务器时,会重新分配最小化密钥。
•Minimized keys are redistributed when servers are added or removed.
•由于数据分布更均匀,因此很容易水平扩展。
•It is easy to scale horizontally because data are more evenly distributed.
•缓解热点密钥问题。对特定分片的过多访问可能会导致服务器过载。想象一下,凯蒂·佩里 (Katy Perry)、贾斯汀·比伯 (Justin Bieber) 和 Lady Gaga 的数据最终都位于同一个分片上。一致的哈希有助于通过更均匀地分布数据来缓解问题。
•Mitigate hotspot key problem. Excessive access to a specific shard could cause server overload. Imagine data for Katy Perry, Justin Bieber, and Lady Gaga all end up on the same shard. Consistent hashing helps to mitigate the problem by distributing the data more evenly.
一致性哈希广泛应用于现实世界的系统中,包括一些值得注意的系统:
Consistent hashing is widely used in real-world systems, including some notable ones:
• Amazon Dynamo 数据库的分区组件 [3]
•Partitioning component of Amazon’s Dynamo database [3]
• Apache Cassandra 中跨集群的数据分区 [4]
•Data partitioning across the cluster in Apache Cassandra [4]
• Discord 聊天应用程序 [5]
•Discord chat application [5]
• Akamai 内容交付网络 [6]
•Akamai content delivery network [6]
•磁悬浮网络负载均衡器[7]
•Maglev network load balancer [7]
恭喜您已经走到这一步了!现在拍拍自己的背吧。好工作!
Congratulations on getting this far! Now give yourself a pat on the back. Good job!
参考资料
Reference materials
[1] 一致性哈希:https://en.wikipedia.org/wiki/Constant_hashing
[1] Consistent hashing: https://en.wikipedia.org/wiki/Consistent_hashing
[2] 一致性哈希:
[2] Consistent Hashing:
https://tom-e-white.com/2007/11/consistency-hashing.html
https://tom-e-white.com/2007/11/consistent-hashing.html
[3] Dynamo:亚马逊的高可用键值存储:https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
[3] Dynamo: Amazon’s Highly Available Key-value Store: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
[4] Cassandra - 去中心化结构化存储系统:
[4] Cassandra - A Decentralized Structured Storage System:
http://www.cs.cornell.edu/Projects/ladis2009/papers/Lakshman-ladis2009.PDF
http://www.cs.cornell.edu/Projects/ladis2009/papers/Lakshman-ladis2009.PDF
[5] Discord 如何将 Elixir 扩展到 5,000,000 个并发用户:https://blog.discord.com/scaling-elixir-f9b8e1e7c29b
[5] How Discord Scaled Elixir to 5,000,000 Concurrent Users: https://blog.discord.com/scaling-elixir-f9b8e1e7c29b
[6] CS168:现代算法工具箱讲座#1:简介和一致性哈希:http://theory.stanford.edu/~tim/s16/l/l1.pdf
[6] CS168: The Modern Algorithmic Toolbox Lecture #1: Introduction and Consistent Hashing: http://theory.stanford.edu/~tim/s16/l/l1.pdf
[7] Maglev:快速可靠的软件网络负载均衡器:https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44824.pdf
[7] Maglev: A Fast and Reliable Software Network Load Balancer: https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44824.pdf
键值存储也称为键值数据库,是一种非关系数据库。每个唯一标识符都存储为键及其关联值。这种数据配对称为“键值”对。
A key-value store, also referred to as a key-value database, is a non-relational database. Each unique identifier is stored as a key with its associated value. This data pairing is known as a “key-value” pair.
在键值对中,键必须是唯一的,并且可以通过键访问与键关联的值。键可以是纯文本或散列值。出于性能原因,短键效果更好。钥匙是什么样子的?这里有一些例子:
In a key-value pair, the key must be unique, and the value associated with the key can be accessed through the key. Keys can be plain text or hashed values. For performance reasons, a short key works better. What do keys look like? Here are a few examples:
•纯文本密钥:“last_logged_in_at”
•Plain text key: “last_logged_in_at”
•散列密钥:253DDEC4
•Hashed key: 253DDEC4
键值对中的值可以是字符串、列表、对象等。该值通常在键值存储中被视为不透明对象,例如 Amazon dynamo [1]、Memcached [2]、Redis [3] , ETC。
The value in a key-value pair can be strings, lists, objects, etc. The value is usually treated as an opaque object in key-value stores, such as Amazon dynamo [1], Memcached [2], Redis [3], etc.
以下是键值存储中的数据片段:
Here is a data snippet in a key-value store:
在本章中,您需要设计一个支持以下操作的键值存储:
In this chapter, you are asked to design a key-value store that supports the following operations:
- put(key, value) // 插入与“key”关联的“value”
- put(key, value) // insert “value” associated with “key”
没有完美的设计。每种设计都在读、写和内存使用的权衡方面实现了特定的平衡。必须在一致性和可用性之间进行另一个权衡。在本章中,我们设计了一个包含以下特征的键值存储:
There is no perfect design. Each design achieves a specific balance regarding the tradeoffs of the read, write, and memory usage. Another tradeoff has to be made was between consistency and availability. In this chapter, we design a key-value store that comprises of the following characteristics:
•键值对的大小很小:小于10 KB。
•The size of a key-value pair is small: less than 10 KB.
•存储大数据的能力。
•Ability to store big data.
•高可用性:即使出现故障,系统也能快速响应。
•High availability: The system responds quickly, even during failures.
•高可扩展性:系统可以扩展以支持大数据集。
•High scalability: The system can be scaled to support large data set.
•自动扩展:应根据流量自动添加/删除服务器。
•Automatic scaling: The addition/deletion of servers should be automatic based on traffic.
•可调节的一致性。
•Tunable consistency.
•低延迟。
•Low latency.
开发驻留在单个服务器中的键值存储很容易。一种直观的方法是将键值对存储在哈希表中,从而将所有内容保存在内存中。尽管内存访问速度很快,但由于空间限制,将所有内容都放入内存中可能是不可能的。可以进行两项优化以在单个服务器中容纳更多数据:
Developing a key-value store that resides in a single server is easy. An intuitive approach is to store key-value pairs in a hash table, which keeps everything in memory. Even though memory access is fast, fitting everything in memory may be impossible due to the space constraint. Two optimizations can be done to fit more data in a single server:
•数据压缩
•Data compression
•仅将常用数据存储在内存中,其余数据存储在磁盘上
•Store only frequently used data in memory and the rest on disk
即使进行了这些优化,单个服务器也可以很快达到其容量。需要分布式键值存储来支持大数据。
Even with these optimizations, a single server can reach its capacity very quickly. A distributed key-value store is required to support big data.
分布式键值存储也称为分布式哈希表,它将键值对分布在许多服务器上。在设计分布式系统时,理解CAP(一致性、可用性、分区容错性)定理非常重要。
A distributed key-value store is also called a distributed hash table, which distributes key-value pairs across many servers. When designing a distributed system, it is important to understand CAP (Consistency, Availability, Partition Tolerance) theorem.
CAP 定理指出,分布式系统不可能同时提供一致性、可用性和分区容错这三个保证中的两个以上。让我们建立一些定义。
CAP theorem states it is impossible for a distributed system to simultaneously provide more than two of these three guarantees: consistency, availability, and partition tolerance. Let us establish a few definitions.
一致性:一致性是指所有客户端无论连接到哪个节点,都同时看到相同的数据。
Consistency: consistency means all clients see the same data at the same time no matter which node they connect to.
可用性:可用性意味着即使某些节点发生故障,任何请求数据的客户端都会得到响应。
Availability: availability means any client which requests data gets a response even if some of the nodes are down.
分区容错性:分区表示两个节点之间的通信中断。分区容错意味着尽管存在网络分区,系统仍能继续运行。
Partition Tolerance: a partition indicates a communication break between two nodes. Partition tolerance means the system continues to operate despite network partitions.
CAP 定理指出,必须牺牲这三个属性之一来支持这 3 个属性中的 2 个,如图 6-1 所示。
CAP theorem states that one of the three properties must be sacrificed to support 2 of the 3 properties as shown in Figure 6-1.
如今,键值存储根据它们支持的两个 CAP 特征进行分类:
Nowadays, key-value stores are classified based on the two CAP characteristics they support:
CP(一致性和分区容错)系统:CP 键值存储支持一致性和分区容错,同时牺牲可用性。
CP (consistency and partition tolerance) systems: a CP key-value store supports consistency and partition tolerance while sacrificing availability.
AP(可用性和分区容错)系统:AP 键值存储支持可用性和分区容错,同时牺牲一致性。
AP (availability and partition tolerance) systems: an AP key-value store supports availability and partition tolerance while sacrificing consistency.
CA(一致性和可用性)系统:CA 键值存储支持一致性和可用性,同时牺牲分区容错性。由于网络故障不可避免,分布式系统必须容忍网络分区。因此,CA系统不可能存在于现实世界的应用中。
CA (consistency and availability) systems: a CA key-value store supports consistency and availability while sacrificing partition tolerance. Since network failure is unavoidable, a distributed system must tolerate network partition. Thus, a CA system cannot exist in real-world applications.
您在上面读到的内容主要是定义部分。为了更容易理解,我们来看一些具体的例子。在分布式系统中,数据通常会被复制多次。假设数据被复制到三个副本节点n1 、n2和n3上,如图 6-2 所示。
What you read above is mostly the definition part. To make it easier to understand, let us take a look at some concrete examples. In distributed systems, data is usually replicated multiple times. Assume data are replicated on three replica nodes, n1, n2 and n3 as shown in Figure 6-2.
理想情况
Ideal situation
在理想的世界中,网络分区永远不会发生。写入n1 的数据会自动复制到n2和n3 。一致性和可用性都达到了。
In the ideal world, network partition never occurs. Data written to n1 is automatically replicated to n2 and n3. Both consistency and availability are achieved.
现实世界的分布式系统
Real-world distributed systems
在分布式系统中,分区是无法避免的,而当分区发生时,我们必须在一致性和可用性之间做出选择。在图6-3中,n3宕机,无法与n1和n2通信。如果客户端将数据写入n1或n2 ,则数据无法传播到 n3 。如果数据写入n3但尚未传播到n1和n2 ,则n1和n2将具有陈旧数据。
In a distributed system, partitions cannot be avoided, and when a partition occurs, we must choose between consistency and availability. In Figure 6-3, n3 goes down and cannot communicate with n1 and n2. If clients write data to n1 or n2, data cannot be propagated to n3. If data is written to n3 but not propagated to n1 and n2 yet, n1 and n2 would have stale data.
如果我们选择一致性而非可用性(CP系统),我们必须阻止对n1和n2的所有写操作,以避免这三个服务器之间数据不一致,从而导致系统不可用。银行系统通常具有极高的一致性要求。例如,银行系统显示最新的余额信息至关重要。如果由于网络分区而出现不一致,银行系统会在不一致解决之前返回错误。
If we choose consistency over availability (CP system), we must block all write operations to n1 and n2 to avoid data inconsistency among these three servers, which makes the system unavailable. Bank systems usually have extremely high consistent requirements. For example, it is crucial for a bank system to display the most up-to-date balance info. If inconsistency occurs due to a network partition, the bank system returns an error before the inconsistency is resolved.
但是,如果我们选择可用性而不是一致性(AP 系统),系统将继续接受读取,即使它可能返回陈旧的数据。对于写入,n1和n2将继续接受写入,并且当网络分区解决时数据将同步到n3 。
However, if we choose availability over consistency (AP system), the system keeps accepting reads, even though it might return stale data. For writes, n1 and n2 will keep accepting writes, and data will be synced to n3 when the network partition is resolved.
选择适合您的用例的正确 CAP 保证是构建分布式键值存储的重要一步。您可以与面试官讨论这个问题并相应地设计系统。
Choosing the right CAP guarantees that fit your use case is an important step in building a distributed key-value store. You can discuss this with your interviewer and design the system accordingly.
在本节中,我们将讨论用于构建键值存储的以下核心组件和技术:
In this section, we will discuss the following core components and techniques used to build a key-value store:
•数据分区
•Data partition
•数据复制
•Data replication
•一致性
•Consistency
•解决不一致问题
•Inconsistency resolution
•处理故障
•Handling failures
•系统架构图
•System architecture diagram
•写入路径
•Write path
•读取路径
•Read path
以下内容主要基于三种流行的键值存储系统:Dynamo [4]、Cassandra [5] 和 BigTable [6]。
The content below is largely based on three popular key-value store systems: Dynamo [4], Cassandra [5], and BigTable [6].
对于大型应用程序,将完整的数据集容纳在单个服务器中是不可行的。实现这一点的最简单方法是将数据分割成更小的分区并将它们存储在多个服务器中。对数据进行分区时存在两个挑战:
For large applications, it is infeasible to fit the complete data set in a single server. The simplest way to accomplish this is to split the data into smaller partitions and store them in multiple servers. There are two challenges while partitioning the data:
•在多个服务器之间均匀分布数据。
•Distribute data across multiple servers evenly.
•添加或删除节点时尽量减少数据移动。
•Minimize data movement when nodes are added or removed.
第 5 章讨论的一致性哈希是解决这些问题的一项很好的技术。让我们重新审视一下一致性哈希在高层次上是如何工作的。
Consistent hashing discussed in Chapter 5 is a great technique to solve these problems. Let us revisit how consistent hashing works at a high-level.
•首先,将服务器放置在哈希环上。在图6-4中,哈希环上放置了8台服务器,分别表示为s0、s1、…、s7 。
•First, servers are placed on a hash ring. In Figure 6-4, eight servers, represented by s0, s1, …, s7, are placed on the hash ring.
•接下来,密钥被散列到同一个环上,并存储在顺时针方向移动时遇到的第一个服务器上。例如,使用此逻辑将key0存储在s1中。
•Next, a key is hashed onto the same ring, and it is stored on the first server encountered while moving in the clockwise direction. For instance, key0 is stored in s1 using this logic.
Using consistent hashing to partition data has the following advantages:
自动伸缩:可以根据负载自动添加和删除服务器。
Automatic scaling: servers could be added and removed automatically depending on the load.
异构性:服务器的虚拟节点数量与服务器容量成正比。例如,容量较高的服务器分配有更多的虚拟节点。
Heterogeneity: the number of virtual nodes for a server is proportional to the server capacity. For example, servers with higher capacity are assigned with more virtual nodes.
为了实现高可用性和可靠性,必须在N 个服务器上异步复制数据,其中N是可配置参数。这N台服务器的选择逻辑如下:将一个key映射到哈希环上的某个位置后,从该位置开始顺时针行走,选择环上的前N台服务器来存储数据副本。在图 6-5 ( N = 3 ) 中,key0在s1、s2和s3处复制。
To achieve high availability and reliability, data must be replicated asynchronously over N servers, where N is a configurable parameter. These N servers are chosen using the following logic: after a key is mapped to a position on the hash ring, walk clockwise from that position and choose the first N servers on the ring to store data copies. In Figure 6-5 (N = 3), key0 is replicated at s1, s2, and s3.
对于虚拟节点,环上的前N个节点可能由少于N 个物理服务器拥有。为了避免这个问题,我们在执行顺时针行走逻辑时仅选择唯一的服务器。
With virtual nodes, the first N nodes on the ring may be owned by fewer than N physical servers. To avoid this issue, we only choose unique servers while performing the clockwise walk logic.
同一数据中心内的节点经常会因为断电、网络问题、自然灾害等原因同时发生故障。为了获得更好的可靠性,副本被放置在不同的数据中心,数据中心通过高速网络连接。
Nodes in the same data center often fail at the same time due to power outages, network issues, natural disasters, etc. For better reliability, replicas are placed in distinct data centers, and data centers are connected through high-speed networks.
由于数据在多个节点上复制,因此必须跨副本同步。仲裁共识可以保证读写操作的一致性。让我们首先建立一些定义。
Since data is replicated at multiple nodes, it must be synchronized across replicas. Quorum consensus can guarantee consistency for both read and write operations. Let us establish a few definitions first.
N =副本数
N = The number of replicas
W = 大小为W的写入仲裁。为了使写入操作被视为成功,写入操作必须得到W 个副本的确认。
W = A write quorum of size W. For a write operation to be considered as successful, write operation must be acknowledged from W replicas.
R = 大小为R的读取法定人数。为了使读取操作被视为成功,读取操作必须等待至少R 个副本的响应。
R = A read quorum of size R. For a read operation to be considered as successful, read operation must wait for responses from at least R replicas.
考虑图 6-6 中所示的以下示例,其中N = 3 。
Consider the following example shown in Figure 6-6 with N = 3.
W = 1并不意味着数据写入一台服务器上。例如,在图 6-6 中的配置中,数据在s0 、s1和s2处复制。W = 1表示协调器必须至少收到一个确认才能认为写入操作成功。例如,如果我们收到s1的确认,则不再需要等待s0和s2的确认。协调器充当客户端和节点之间的代理。
W = 1 does not mean data is written on one server. For instance, with the configuration in Figure 6-6, data is replicated at s0, s1, and s2. W = 1 means that the coordinator must receive at least one acknowledgment before the write operation is considered as successful. For instance, if we get an acknowledgment from s1, we no longer need to wait for acknowledgements from s0 and s2. A coordinator acts as a proxy between the client and the nodes.
W、R和N的配置是延迟和一致性之间的典型权衡。如果W = 1或R = 1 ,则操作会快速返回,因为协调器只需要等待任何副本的响应。如果W或R>1 ,系统具有更好的一致性;但是,查询会变慢,因为协调器必须等待最慢副本的响应。
The configuration of W, R and N is a typical tradeoff between latency and consistency. If W = 1 or R = 1, an operation is returned quickly because a coordinator only needs to wait for a response from any of the replicas. If W or R > 1, the system offers better consistency; however, the query will be slower because the coordinator must wait for the response from the slowest replica.
如果W + R > N ,则可以保证强一致性,因为必须至少有一个重叠节点拥有最新数据才能保证一致性。
If W + R > N, strong consistency is guaranteed because there must be at least one overlapping node that has the latest data to ensure consistency.
如何配置N、W和R来适应我们的用例?以下是一些可能的设置:
How to configure N, W, and R to fit our use cases? Here are some of the possible setups:
如果R = 1且W = N ,则系统针对快速读取进行优化。
If R = 1 and W = N, the system is optimized for a fast read.
如果W = 1 且 R = N ,则系统针对快速写入进行优化。
If W = 1 and R = N, the system is optimized for fast write.
如果W + R > N ,则保证强一致性(通常N = 3,W = R = 2 )。
If W + R > N, strong consistency is guaranteed (Usually N = 3, W = R = 2).
如果W + R <= N ,则不能保证强一致性。
If W + R <= N, strong consistency is not guaranteed.
根据要求,我们可以调整W、R、N的值以达到所需的一致性水平。
Depending on the requirement, we can tune the values of W, R, N to achieve the desired level of consistency.
一致性模型是设计键值存储时要考虑的另一个重要因素。一致性模型定义了数据一致性的程度,并且存在多种可能的一致性模型:
Consistency model is other important factor to consider when designing a key-value store. A consistency model defines the degree of data consistency, and a wide spectrum of possible consistency models exist:
•强一致性:任何读操作都会返回与最新更新的写入数据项的结果相对应的值。客户永远不会看到过时的数据。
•Strong consistency: any read operation returns a value corresponding to the result of the most updated write data item. A client never sees out-of-date data.
•弱一致性:后续的读操作可能看不到最新的值。
•Weak consistency: subsequent read operations may not see the most updated value.
•最终一致性:这是弱一致性的一种特定形式。如果有足够的时间,所有更新都会传播,并且所有副本都是一致的。
•Eventual consistency: this is a specific form of weak consistency. Given enough time, all updates are propagated, and all replicas are consistent.
强一致性通常是通过强制副本不接受新的读/写,直到每个副本都同意当前写入来实现的。这种方法对于高可用系统来说并不理想,因为它可能会阻止新的操作。Dynamo 和 Cassandra 采用最终一致性,这是我们为键值存储推荐的一致性模型。从并发写入中,最终一致性允许不一致的值进入系统并强制客户端读取值以进行协调。下一节将解释协调如何与版本控制一起工作。
Strong consistency is usually achieved by forcing a replica not to accept new reads/writes until every replica has agreed on current write. This approach is not ideal for highly available systems because it could block new operations. Dynamo and Cassandra adopt eventual consistency, which is our recommended consistency model for our key-value store. From concurrent writes, eventual consistency allows inconsistent values to enter the system and force the client to read the values to reconcile. The next section explains how reconciliation works with versioning.
复制提供高可用性,但会导致副本之间不一致。版本控制和矢量时钟用于解决不一致问题。版本控制意味着将每个数据修改视为数据的新的不可变版本。在讨论版本控制之前,我们先用一个例子来解释一下不一致是如何发生的:
Replication gives high availability but causes inconsistencies among replicas. Versioning and vector clocks are used to solve inconsistency problems. Versioning means treating each data modification as a new immutable version of data. Before we talk about versioning, let us use an example to explain how inconsistency happens:
如图6-7所示,副本节点n1和n2具有相同的值。我们将此值称为原始值。服务器 1和服务器 2通过get(“name”)操作获得相同的值。
As shown in Figure 6-7, both replica nodes n1 and n2 have the same value. Let us call this value the original value. Server 1 and server 2 get the same value for get(“name”) operation.
接下来,服务器1将名称更改为“johnSanFrancisco”,服务器2将名称更改为“johnNewYork”,如图6-8所示。这两个改变是同时进行的。现在,我们有冲突的值,称为版本v1和v2 。
Next, server 1 changes the name to “johnSanFrancisco”, and server 2 changes the name to “johnNewYork” as shown in Figure 6-8. These two changes are performed simultaneously. Now, we have conflicting values, called versions v1 and v2.
在此示例中,可以忽略原始值,因为修改是基于它的。然而,没有明确的方法来解决最后两个版本的冲突。为了解决这个问题,我们需要一个能够检测冲突并协调冲突的版本控制系统。矢量时钟是解决这个问题的常用技术。让我们看看矢量时钟是如何工作的。
In this example, the original value could be ignored because the modifications were based on it. However, there is no clear way to resolve the conflict of the last two versions. To resolve this issue, we need a versioning system that can detect conflicts and reconcile conflicts. A vector clock is a common technique to solve this problem. Let us examine how vector clocks work.
矢量时钟是与数据项关联的[服务器,版本]对。它可用于检查一个版本是否先于、成功或与其他版本冲突。
A vector clock is a [server, version] pair associated with a data item. It can be used to check if one version precedes, succeeds, or in conflict with others.
假设矢量时钟由D([S1, v1], [S2, v2], …, [Sn, vn])表示,其中D是数据项,v1是版本计数器,s1是服务器编号,如果数据项D被写入服务器Si ,系统必须执行以下任务之一。
Assume a vector clock is represented by D([S1, v1], [S2, v2], …, [Sn, vn]), where D is a data item, v1 is a version counter, and s1 is a server number, etc. If data item D is written to server Si, the system must perform one of the following tasks.
•如果[Si, vi]存在,则增加vi 。
•Increment vi if [Si, vi] exists.
•否则,创建一个新条目[Si, 1] 。
•Otherwise, create a new entry [Si, 1].
通过一个具体的例子来解释上述抽象逻辑,如图6-9所示。
The above abstract logic is explained with a concrete example as shown in Figure 6-9.
1. 客户端将数据项D1写入系统,写入由服务器Sx处理,服务器 Sx 现在具有矢量时钟D1[(Sx, 1)] 。
1. A client writes a data item D1 to the system, and the write is handled by server Sx, which now has the vector clock D1[(Sx, 1)].
2. 另一个客户端读取最新的D1 ,将其更新为D2 ,然后写回。D2从D1下降,因此它会覆盖D1 。假设写入由同一服务器Sx处理,该服务器现在具有矢量时钟D2([Sx, 2]) 。
2. Another client reads the latest D1, updates it to D2, and writes it back. D2 descends from D1 so it overwrites D1. Assume the write is handled by the same server Sx, which now has vector clock D2([Sx, 2]).
3. 另一个客户端读取最新的D2 ,将其更新为D3 ,然后写回。假设写入由服务器Sy处理,该服务器现在具有矢量时钟D3([Sx, 2], [Sy, 1])) 。
3. Another client reads the latest D2, updates it to D3, and writes it back. Assume the write is handled by server Sy, which now has vector clock D3([Sx, 2], [Sy, 1])).
4. 另一个客户端读取最新的D2 ,将其更新为D4 ,然后写回。假设写入由服务器Sz处理,该服务器现在有D4([Sx, 2], [Sz, 1])) 。
4. Another client reads the latest D2, updates it to D4, and writes it back. Assume the write is handled by server Sz, which now has D4([Sx, 2], [Sz, 1])).
5. 当另一个客户端读取D3和D4时,发现冲突,这是由于Sy和Sz都修改了数据项D2造成的。客户端解决冲突并将更新的数据发送到服务器。假设写入由Sx处理,它现在有D5([Sx, 3], [Sy, 1], [Sz, 1]) 。我们将很快解释如何检测冲突。
5. When another client reads D3 and D4, it discovers a conflict, which is caused by data item D2 being modified by both Sy and Sz. The conflict is resolved by the client and updated data is sent to the server. Assume the write is handled by Sx, which now has D5([Sx, 3], [Sy, 1], [Sz, 1]). We will explain how to detect conflict shortly.
使用矢量时钟,如果Y的矢量时钟中每个参与者的版本计数器大于或等于版本 X 中的版本计数器,则很容易判断版本X是版本Y的祖先(即没有冲突)。例如,矢量时钟D([s0, 1], [s1, 1])]是D([s0, 1], [s1, 2])的祖先。因此,没有记录任何冲突。
Using vector clocks, it is easy to tell that a version X is an ancestor (i.e. no conflict) of version Y if the version counters for each participant in the vector clock of Y is greater than or equal to the ones in version X. For example, the vector clock D([s0, 1], [s1, 1])] is an ancestor of D([s0, 1], [s1, 2]). Therefore, no conflict is recorded.
类似地,如果Y的矢量时钟中的任何参与者的计数器小于 X 中对应的计数器,则可以判断版本X是 Y 的同级版本(即存在冲突)。例如,以下两个矢量时钟表示存在冲突:D([s0, 1], [s1, 2])和D([s0, 2], [s1, 1])。
Similarly, you can tell that a version X is a sibling (i.e., a conflict exists) of Y if there is any participant in Y's vector clock who has a counter that is less than its corresponding counter in X. For example, the following two vector clocks indicate there is a conflict: D([s0, 1], [s1, 2]) and D([s0, 2], [s1, 1]).
尽管矢量时钟可以解决冲突,但也有两个显着的缺点。首先,矢量时钟增加了客户端的复杂性,因为它需要实现冲突解决逻辑。
Even though vector clocks can resolve conflicts, there are two notable downsides. First, vector clocks add complexity to the client because it needs to implement conflict resolution logic.
其次,矢量时钟中的[服务器:版本]对可能会快速增长。为了解决这个问题,我们设置了长度阈值,如果超过限制,则删除最旧的对。这可能会导致协调效率低下,因为无法准确确定后代关系。然而,根据Dynamo论文[4],亚马逊在生产中尚未遇到此问题;因此,对于大多数公司来说,这可能是一个可以接受的解决方案。
Second, the [server: version] pairs in the vector clock could grow rapidly. To fix this problem, we set a threshold for the length, and if it exceeds the limit, the oldest pairs are removed. This can lead to inefficiencies in reconciliation because the descendant relationship cannot be determined accurately. However, based on Dynamo paper [4], Amazon has not yet encountered this problem in production; therefore, it is probably an acceptable solution for most companies.
与任何大规模的大型系统一样,故障不仅是不可避免的,而且是常见的。处理故障场景非常重要。在本节中,我们首先介绍检测故障的技术。然后,我们回顾常见的故障解决策略。
As with any large system at scale, failures are not only inevitable but common. Handling failure scenarios is very important. In this section, we first introduce techniques to detect failures. Then, we go over common failure resolution strategies.
在分布式系统中,仅仅因为另一台服务器这样说就相信一台服务器已关闭是不够的。通常,需要至少两个独立的信息源来标记服务器。
In a distributed system, it is insufficient to believe that a server is down because another server says so. Usually, it requires at least two independent sources of information to mark a server down.
如图 6-10 所示,全对全多播是一种简单的解决方案。然而,当系统中有很多服务器时,这种方法效率很低。
As shown in Figure 6-10, all-to-all multicasting is a straightforward solution. However, this is inefficient when many servers are in the system.
更好的解决方案是使用去中心化的故障检测方法,例如八卦协议。Gossip 协议的工作原理如下:
A better solution is to use decentralized failure detection methods like gossip protocol. Gossip protocol works as follows:
•每个节点维护一个节点成员列表,其中包含成员ID 和心跳计数器。
•Each node maintains a node membership list, which contains member IDs and heartbeat counters.
•每个节点定期增加其心跳计数器。
•Each node periodically increments its heartbeat counter.
•每个节点定期向一组随机节点发送心跳,这些节点又传播到另一组节点。
•Each node periodically sends heartbeats to a set of random nodes, which in turn propagate to another set of nodes.
•一旦节点收到心跳,成员列表就会更新为最新信息。
•Once nodes receive heartbeats, membership list is updated to the latest info.
•如果心跳在超过预定义的时间段内没有增加,则该成员被视为离线。
•If the heartbeat has not increased for more than predefined periods, the member is considered as offline.
如图6-11所示:
As shown in Figure 6-11:
•节点s0维护左侧显示的节点成员资格列表。
•Node s0 maintains a node membership list shown on the left side.
•节点s0注意到节点s2(成员ID = 2)的心跳计数器长时间没有增加。
•Node s0 notices that node s2’s (member ID = 2) heartbeat counter has not increased for a long time.
•节点s0向一组随机节点发送包含s2信息的心跳。一旦其他节点确认s2的心跳计数器长时间没有更新,节点s2就会被标记为 down,并将此信息传播到其他节点。
•Node s0 sends heartbeats that include s2’s info to a set of random nodes. Once other nodes confirm that s2’s heartbeat counter has not been updated for a long time, node s2 is marked down, and this information is propagated to other nodes.
通过八卦协议检测到故障后,系统需要部署某些机制来确保可用性。在严格的仲裁方法中,读取和写入操作可能会被阻止,如仲裁共识部分所示。
After failures have been detected through the gossip protocol, the system needs to deploy certain mechanisms to ensure availability. In the strict quorum approach, read and write operations could be blocked as illustrated in the quorum consensus section.
一种称为“草率仲裁”[4] 的技术用于提高可用性。系统不会强制执行法定人数要求,而是选择哈希环上的前W 个健康服务器进行写入,并选择前R 个健康服务器进行读取。离线服务器将被忽略。
A technique called “sloppy quorum” [4] is used to improve availability. Instead of enforcing the quorum requirement, the system chooses the first W healthy servers for writes and first R healthy servers for reads on the hash ring. Offline servers are ignored.
如果一台服务器由于网络或服务器故障而不可用,则另一台服务器将临时处理请求。当宕机的服务器上线时,更改将被推回以实现数据一致性。这个过程称为提示切换。由于图6-12中s2不可用,因此读写操作将暂时由s3处理。当s2重新上线时,s3会将数据交还给s2 。
If a server is unavailable due to network or server failures, another server will process requests temporarily. When the down server is up, changes will be pushed back to achieve data consistency. This process is called hinted handoff. Since s2 is unavailable in Figure 6-12, reads and writes will be handled by s3 temporarily. When s2 comes back online, s3 will hand the data back to s2.
提示切换用于处理临时故障。如果副本永久不可用怎么办?为了处理这种情况,我们实现了反熵协议来保持副本同步。反熵涉及比较副本上的每条数据并将每个副本更新到最新版本。Merkle 树用于不一致检测并最大限度地减少传输的数据量。
Hinted handoff is used to handle temporary failures. What if a replica is permanently unavailable? To handle such a situation, we implement an anti-entropy protocol to keep replicas in sync. Anti-entropy involves comparing each piece of data on replicas and updating each replica to the newest version. A Merkle tree is used for inconsistency detection and minimizing the amount of data transferred.
引用自维基百科 [7]:“哈希树或 Merkle 树是一棵树,其中每个非叶节点都用其子节点的标签或值(如果是叶)的哈希值进行标记。哈希树允许高效、安全地验证大型数据结构的内容”。
Quoted from Wikipedia [7]: “A hash tree or Merkle tree is a tree in which every non-leaf node is labeled with the hash of the labels or values (in case of leaves) of its child nodes. Hash trees allow efficient and secure verification of the contents of large data structures”.
假设键空间为1到12,以下步骤展示了如何构建Merkle树。突出显示的框表示不一致。
Assuming key space is from 1 to 12, the following steps show how to build a Merkle tree. Highlighted boxes indicate inconsistency.
步骤 1:将密钥空间划分为多个桶(在我们的示例中为 4 个桶),如图 6-13 所示。桶用作根级节点以维持树的有限深度。
Step 1: Divide key space into buckets (4 in our example) as shown in Figure 6-13. A bucket is used as the root level node to maintain a limited depth of the tree.
步骤 2:创建存储桶后,使用统一哈希方法对存储桶中的每个键进行哈希处理(图 6-14)。
Step 2: Once the buckets are created, hash each key in a bucket using a uniform hashing method (Figure 6-14).
Step 3: Create a single hash node per bucket (Figure 6-15).
步骤 4:通过计算子节点的哈希值向上构建树直至根(图 6-16)。
Step 4: Build the tree upwards till root by calculating hashes of children (Figure 6-16).
要比较两棵 Merkle 树,首先要比较根哈希值。如果根哈希匹配,则两个服务器具有相同的数据。如果根哈希值不一致,则比较左子哈希值,然后比较右子哈希值。您可以遍历树来查找哪些存储桶未同步,并仅同步这些存储桶。
To compare two Merkle trees, start by comparing the root hashes. If root hashes match, both servers have the same data. If root hashes disagree, then the left child hashes are compared followed by right child hashes. You can traverse the tree to find which buckets are not synchronized and synchronize those buckets only.
使用 Merkle 树,需要同步的数据量与两个副本之间的差异成正比,而不是它们包含的数据量。在现实系统中,桶的大小相当大。例如,可能的配置是每 10 亿个键有 100 万个桶,因此每个桶仅包含 1000 个键。
Using Merkle trees, the amount of data needed to be synchronized is proportional to the differences between the two replicas, and not the amount of data they contain. In real-world systems, the bucket size is quite big. For instance, a possible configuration is one million buckets per one billion keys, so each bucket only contains 1000 keys.
数据中心可能因停电、网络中断、自然灾害等原因而发生中断。为了构建能够处理数据中心中断的系统,跨多个数据中心复制数据非常重要。即使某个数据中心完全离线,用户仍然可以通过其他数据中心访问数据。
Data center outage could happen due to power outage, network outage, natural disaster, etc. To build a system capable of handling data center outage, it is important to replicate data across multiple data centers. Even if a data center is completely offline, users can still access data through the other data centers.
现在我们已经讨论了设计键值存储时的不同技术考虑因素,我们可以将注意力转移到架构图上,如图 6-17 所示。
Now that we have discussed different technical considerations in designing a key-value store, we can shift our focus on the architecture diagram, shown in Figure 6-17.
该架构的主要特点如下:
Main features of the architecture are listed as follows:
•客户端通过简单的API 与键值存储进行通信:get(key)和put(key, value) 。
•Clients communicate with the key-value store through simple APIs: get(key) and put(key, value).
•协调器是充当客户端和键值存储之间的代理的节点。
•A coordinator is a node that acts as a proxy between the client and the key-value store.
•使用一致散列将节点分布在环上。
•Nodes are distributed on a ring using consistent hashing.
•该系统是完全去中心化的,因此可以自动添加和移动节点。
•The system is completely decentralized so adding and moving nodes can be automatic.
•数据在多个节点上复制。
•Data is replicated at multiple nodes.
•不存在单点故障,因为每个节点都有相同的职责集。
•There is no single point of failure as every node has the same set of responsibilities.
由于设计是分散的,每个节点执行许多任务,如图 6-18 所示。
As the design is decentralized, each node performs many tasks as presented in Figure 6-18.
图 6-19 解释了写请求定向到特定节点后会发生什么。请注意,建议的写/读路径设计主要基于 Cassandra [8] 的架构。
Figure 6-19 explains what happens after a write request is directed to a specific node. Please note the proposed designs for write/read paths are primary based on the architecture of Cassandra [8].
1. 写请求保存在提交日志文件中。
1. The write request is persisted on a commit log file.
2. 数据保存在内存缓存中。
2. Data is saved in the memory cache.
3. 当内存缓存已满或达到预定义阈值时,数据将刷新到磁盘上的 SSTable [9]。注意:排序字符串表 (SSTable) 是 <key, value> 对的排序列表。有兴趣了解更多有关 SStable 的读者,请参阅参考资料 [9]。
3. When the memory cache is full or reaches a predefined threshold, data is flushed to SSTable [9] on disk. Note: A sorted-string table (SSTable) is a sorted list of <key, value> pairs. For readers interested in learning more about SStable, refer to the reference material [9].
当读请求被定向到特定节点后,它首先检查数据是否在内存缓存中。如果是,则将数据返回给客户端,如图6-20所示。
After a read request is directed to a specific node, it first checks if data is in the memory cache. If so, the data is returned to the client as shown in Figure 6-20.
如果数据不在内存中,则会从磁盘中检索。我们需要一种有效的方法来找出哪个 SSTable 包含密钥。布隆过滤器[10]通常用于解决这个问题。
If the data is not in memory, it will be retrieved from the disk instead. We need an efficient way to find out which SSTable contains the key. Bloom filter [10] is commonly used to solve this problem.
当数据不在内存中时,读取路径如图6-21所示。
The read path is shown in Figure 6-21 when data is not in memory.
1. 系统首先检查内存中是否有数据。如果没有,请转至步骤 2。
1. The system first checks if data is in memory. If not, go to step 2.
2. 如果数据不在内存中,系统会检查布隆过滤器。
2. If data is not in memory, the system checks the bloom filter.
3. 布隆过滤器用于找出哪些 SSTable 可能包含该密钥。
3. The bloom filter is used to figure out which SSTables might contain the key.
4. SSTables 返回数据集的结果。
4. SSTables return the result of the data set.
5、数据集的结果返回给客户端。
5. The result of the data set is returned to the client.
参考资料
Reference materials
[1]亚马逊DynamoDB:https://aws.amazon.com/dynamodb/
[1] Amazon DynamoDB: https://aws.amazon.com/dynamodb/
[2]memcached: https: //memcached.org/
[2] memcached: https://memcached.org/
[3]Redis: https: //redis.io/
[3] Redis: https://redis.io/
[4] Dynamo:亚马逊的高可用键值存储:https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
[4] Dynamo: Amazon’s Highly Available Key-value Store: https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
[5]卡桑德拉: https: //cassandra.apache.org/
[5] Cassandra: https://cassandra.apache.org/
[6] Bigtable:结构化数据的分布式存储系统:https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
[6] Bigtable: A Distributed Storage System for Structured Data: https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
[7]默克尔树:https://en.wikipedia.org/wiki/Merkle_tree
[7] Merkle tree: https://en.wikipedia.org/wiki/Merkle_tree
[8] Cassandra架构:https://cassandra.apache.org/doc/latest/architecture/
[8] Cassandra architecture: https://cassandra.apache.org/doc/latest/architecture/
[9] SStable:https://www.igvita.com/2012/02/06/sstable-and-log-structed-storage-leveldb/
[9] SStable: https://www.igvita.com/2012/02/06/sstable-and-log-structured-storage-leveldb/
[10] 布隆过滤器https://en.wikipedia.org/wiki/Bloom_filter
[10] Bloom filter https://en.wikipedia.org/wiki/Bloom_filter
在本章中,您需要在分布式系统中设计一个独特的 ID 生成器。您的第一个想法可能是在传统数据库中使用带有auto_increment属性的主键。然而,auto_increment不适用于分布式环境,因为单个数据库服务器不够大,并且以最小的延迟跨多个数据库生成唯一 ID 具有挑战性。
In this chapter, you are asked to design a unique ID generator in distributed systems. Your first thought might be to use a primary key with the auto_increment attribute in a traditional database. However, auto_increment does not work in a distributed environment because a single database server is not large enough and generating unique IDs across multiple databases with minimal delay is challenging.
以下是唯一 ID 的一些示例:
Here are a few examples of unique IDs:
提出澄清问题是解决任何系统设计面试问题的第一步。以下是候选人与面试官互动的示例:
Asking clarification questions is the first step to tackle any system design interview question. Here is an example of candidate-interviewer interaction:
考生:唯一ID有什么特点?
Candidate: What are the characteristics of unique IDs?
面试官:ID必须是唯一的并且可排序。
Interviewer: IDs must be unique and sortable.
候选人:每一条新记录,ID 都会加 1 吗?
Candidate: For each new record, does ID increment by 1?
面试官:ID是按时间递增的,但不一定只递增1。当天晚上创建的ID比早上创建的ID要大。
Interviewer: The ID increments by time but not necessarily only increments by 1. IDs created in the evening are larger than those created in the morning on the same day.
考生:ID 只包含数字值吗?
Candidate: Do IDs only contain numerical values?
采访者:是的,这是正确的。
Interviewer: Yes, that is correct.
考生:身份证件长度要求是多少?
Candidate: What is the ID length requirement?
面试官:ID应该适合64位。
Interviewer: IDs should fit into 64-bit.
候选人:系统的规模有多大?
Candidate: What is the scale of the system?
面试官:系统应该能够每秒生成10000个ID。
Interviewer: The system should be able to generate 10,000 IDs per second.
以上是您可以向面试官提出的一些示例问题。了解要求并澄清歧义非常重要。对于本次面试问题,要求如下:
Above are some of the sample questions that you can ask your interviewer. It is important to understand the requirements and clarify ambiguities. For this interview question, the requirements are listed as follows:
• ID 必须是唯一的。
•IDs must be unique.
• ID 只是数字值。
•IDs are numerical values only.
• ID 适合64 位。
•IDs fit into 64-bit.
• ID 按日期排序。
•IDs are ordered by date.
•每秒能够生成超过 10,000 个唯一 ID。
•Ability to generate over 10,000 unique IDs per second.
可以使用多个选项在分布式系统中生成唯一 ID。我们考虑的选项是:
Multiple options can be used to generate unique IDs in distributed systems. The options we considered are:
•多主复制
•Multi-master replication
•通用唯一标识符(UUID)
•Universally unique identifier (UUID)
•票务服务器
•Ticket server
• Twitter 雪花方法
•Twitter snowflake approach
让我们看看它们中的每一个、它们的工作原理以及每个选项的优缺点。
Let us look at each of them, how they work, and the pros/cons of each option.
如图7-2所示,第一种方法是多主复制。
As shown in Figure 7-2, the first approach is multi-master replication.
这种方法使用数据库的自动增量功能。我们不是将下一个 ID 加 1,而是将其加k,其中k是正在使用的数据库服务器的数量。如图 7-2 所示,要生成的下一个 ID 等于同一服务器中的前一个 ID 加 2。这解决了一些可扩展性问题,因为 ID 可以随着数据库服务器的数量而扩展。然而,这种策略有一些主要缺点:
This approach uses the databases’ auto_increment feature. Instead of increasing the next ID by 1, we increase it by k, where k is the number of database servers in use. As illustrated in Figure 7-2, next ID to be generated is equal to the previous ID in the same server plus 2. This solves some scalability issues because IDs can scale with the number of database servers. However, this strategy has some major drawbacks:
•难以通过多个数据中心进行扩展
•Hard to scale with multiple data centers
•跨多个服务器的ID 不会随着时间的推移而增加。
•IDs do not go up with time across multiple servers.
•当添加或删除服务器时,它不能很好地扩展。
•It does not scale well when a server is added or removed.
UUID 是获取唯一 ID 的另一种简单方法。UUID 是一个 128 位数字,用于识别计算机系统中的信息。UUID 串通的可能性非常低。引用维基百科的说法,“在大约 100 年内每秒生成 10 亿个 UUID 后,创建单个重复项的概率将达到 50%”[1]。
A UUID is another easy way to obtain unique IDs. UUID is a 128-bit number used to identify information in computer systems. UUID has a very low probability of getting collusion. Quoted from Wikipedia, “after generating 1 billion UUIDs every second for approximately 100 years would the probability of creating a single duplicate reach 50%” [1].
以下是 UUID 的示例:09c93e62-50b4-468d-bf8a-c07e1040bfb2 。UUID可以独立生成,无需服务器之间协调。图 7-3 展示了 UUID 设计。
Here is an example of UUID: 09c93e62-50b4-468d-bf8a-c07e1040bfb2. UUIDs can be generated independently without coordination between servers. Figure 7-3 presents the UUIDs design.
在本设计中,每个Web服务器都包含一个ID生成器,并且Web服务器负责独立生成ID。
In this design, each web server contains an ID generator, and a web server is responsible for generating IDs independently.
优点:
Pros:
•生成UUID 很简单。服务器之间不需要协调,因此不会出现任何同步问题。
•Generating UUID is simple. No coordination between servers is needed so there will not be any synchronization issues.
•该系统易于扩展,因为每个Web 服务器负责生成它们所使用的ID。ID 生成器可以轻松地与 Web 服务器一起扩展。
•The system is easy to scale because each web server is responsible for generating IDs they consume. ID generator can easily scale with web servers.
缺点:
Cons:
• ID 的长度为 128 位,但我们的要求是 64 位。
•IDs are 128 bits long, but our requirement is 64 bits.
• ID 不会随时间增加。
•IDs do not go up with time.
• ID 可以是非数字的。
•IDs could be non-numeric.
票证服务器是生成唯一 ID 的另一种有趣方式。Flicker 开发了票证服务器来生成分布式主键 [2]。值得一提的是该系统的工作原理。
Ticket servers are another interesting way to generate unique IDs. Flicker developed ticket servers to generate distributed primary keys [2]. It is worth mentioning how the system works.
这个想法是在单个数据库服务器(票务服务器)中使用集中式自动增量功能。要了解更多相关信息,请参阅 flicker 的工程博客文章 [2]。
The idea is to use a centralized auto_increment feature in a single database server (Ticket Server). To learn more about this, refer to flicker’s engineering blog article [2].
优点:
Pros:
•数字ID。
•Numeric IDs.
•易于实施,适用于中小型应用程序。
•It is easy to implement, and it works for small to medium-scale applications.
缺点:
Cons:
•单点故障。单一票证服务器意味着如果票证服务器出现故障,所有依赖它的系统都将面临问题。为了避免单点故障,我们可以设置多个票证服务器。然而,这会带来新的挑战,例如数据同步。
•Single point of failure. Single ticket server means if the ticket server goes down, all systems that depend on it will face issues. To avoid a single point of failure, we can set up multiple ticket servers. However, this will introduce new challenges such as data synchronization.
上述方法为我们提供了一些关于不同 ID 生成系统如何工作的想法。然而,它们都不符合我们的具体要求;因此,我们需要另一种方法。Twitter 独特的 ID 生成系统“雪花”[3] 很鼓舞人心,可以满足我们的要求。
Approaches mentioned above give us some ideas about how different ID generation systems work. However, none of them meet our specific requirements; thus, we need another approach. Twitter’s unique ID generation system called “snowflake” [3] is inspiring and can satisfy our requirements.
分而治之是我们的朋友。我们不是直接生成ID,而是将ID分成不同的部分。图 7-5 显示了 64 位 ID 的布局。
Divide and conquer is our friend. Instead of generating an ID directly, we divide an ID into different sections. Figure 7-5 shows the layout of a 64-bit ID.
每个部分的解释如下。
Each section is explained below.
•符号位:1 位。它始终为 0。这是为将来使用而保留的。它可以用来区分有符号数和无符号数。
•Sign bit: 1 bit. It will always be 0. This is reserved for future uses. It can potentially be used to distinguish between signed and unsigned numbers.
•时间戳:41 位。自纪元或自定义纪元以来的毫秒数。我们使用 Twitter 雪花默认纪元 1288834974657,相当于 2010 年 11 月 4 日,01:42:54 UTC。
•Timestamp: 41 bits. Milliseconds since the epoch or custom epoch. We use Twitter snowflake default epoch 1288834974657, equivalent to Nov 04, 2010, 01:42:54 UTC.
•数据中心ID:5 位,这为我们提供了2 ^ 5 = 32 个数据中心。
•Datacenter ID: 5 bits, which gives us 2 ^ 5 = 32 datacenters.
•机器ID:5 位,每个数据中心有2 ^ 5 = 32台机器。
•Machine ID: 5 bits, which gives us 2 ^ 5 = 32 machines per datacenter.
•序列号:12位。对于该机器/进程上生成的每个 ID,序列号都会增加 1。该数字每毫秒重置为 0。
•Sequence number: 12 bits. For every ID generated on that machine/process, the sequence number is incremented by 1. The number is reset to 0 every millisecond.
在高层设计中,我们讨论了在分布式系统中设计独特 ID 生成器的各种选项。我们选择了一种基于 Twitter 雪花 ID 生成器的方法。让我们深入了解设计。为了加深我们的记忆,下面重新列出了设计图。
In the high-level design, we discussed various options to design a unique ID generator in distributed systems. We settle on an approach that is based on the Twitter snowflake ID generator. Let us dive deep into the design. To refresh our memory, the design diagram is relisted below.
数据中心 ID 和机器 ID 在启动时选择,通常在系统启动运行后固定。数据中心 ID 和计算机 ID 的任何更改都需要仔细检查,因为这些值的意外更改可能会导致 ID 冲突。时间戳和序列号是在 ID 生成器运行时生成的。
Datacenter IDs and machine IDs are chosen at the startup time, generally fixed once the system is up running. Any changes in datacenter IDs and machine IDs require careful review since an accidental change in those values can lead to ID conflicts. Timestamp and sequence numbers are generated when the ID generator is running.
时间戳
Timestamp
最重要的 41 位组成了时间戳部分。由于时间戳随时间增长,ID 可以按时间排序。图 7-7 显示了如何将二进制表示形式转换为 UTC 的示例。您还可以使用类似的方法将 UTC 转换回二进制表示形式。
The most important 41 bits make up the timestamp section. As timestamps grow with time, IDs are sortable by time. Figure 7-7 shows an example of how binary representation is converted to UTC. You can also convert UTC back to binary representation using a similar method.
可以用 41 位表示的最大时间戳为
The maximum timestamp that can be represented in 41 bits is
2 ^ 41 - 1 = 2199023255551毫秒 (ms),这给我们: ~ 69 年 =
2 ^ 41 - 1 = 2199023255551 milliseconds (ms), which gives us: ~ 69 years =
2199023255551 毫秒/1000 秒/365 天/24 小时/3600 秒。这意味着 ID 生成器将工作 69 年,并且具有接近今天日期的自定义纪元时间会延迟溢出时间。69年后,我们将需要一个新的纪元时间或采用其他技术来迁移ID。
2199023255551 ms / 1000 seconds / 365 days / 24 hours/ 3600 seconds. This means the ID generator will work for 69 years and having a custom epoch time close to today’s date delays the overflow time. After 69 years, we will need a new epoch time or adopt other techniques to migrate IDs.
序列号
Sequence number
序列号是 12 位,这给了我们 2 ^ 12 = 4096 种组合。除非同一服务器上一毫秒内生成多个 ID,否则该字段为 0。理论上,一台机器每毫秒最多可以支持4096个新ID。
Sequence number is 12 bits, which give us 2 ^ 12 = 4096 combinations. This field is 0 unless more than one ID is generated in a millisecond on the same server. In theory, a machine can support a maximum of 4096 new IDs per millisecond.
在本章中,我们讨论了设计唯一 ID 生成器的不同方法:多主复制、UUID、票证服务器和 Twitter 雪花式唯一 ID 生成器。我们选择了雪花,因为它支持我们所有的用例并且可以在分布式环境中扩展。
In this chapter, we discussed different approaches to design a unique ID generator: multi-master replication, UUID, ticket server, and Twitter snowflake-like unique ID generator. We settle on snowflake as it supports all our use cases and is scalable in a distributed environment.
如果采访结束时还有额外的时间,这里有一些额外的谈话要点:
If there is extra time at the end of the interview, here are a few additional talking points:
•时钟同步。在我们的设计中,我们假设 ID 生成服务器具有相同的时钟。当服务器在多个内核上运行时,这种假设可能不成立。多机场景也存在同样的挑战。时钟同步的解决方案超出了本书的范围;然而,重要的是要了解问题的存在。网络时间协议是该问题最流行的解决方案。有兴趣的读者可以参考参考资料[4]。
•Clock synchronization. In our design, we assume ID generation servers have the same clock. This assumption might not be true when a server is running on multiple cores. The same challenge exists in multi-machine scenarios. Solutions to clock synchronization are out of the scope of this book; however, it is important to understand the problem exists. Network Time Protocol is the most popular solution to this problem. For interested readers, refer to the reference material [4].
•节长度调整。例如,较少的序列号但较多的时间戳位对于低并发和长期的应用程序是有效的。
•Section length tuning. For example, fewer sequence numbers but more timestamp bits are effective for low concurrency and long-term applications.
•高可用性。由于 ID 生成器是一个关键任务系统,因此它必须具有高可用性。
•High availability. Since an ID generator is a mission-critical system, it must be highly available.
恭喜您已经走到这一步了!现在拍拍自己的背吧。好工作!
Congratulations on getting this far! Now give yourself a pat on the back. Good job!
参考资料
Reference materials
[1] 通用唯一标识符:https ://en.wikipedia.org/wiki/Universally_unique_identifier
[1] Universally unique identifier: https://en.wikipedia.org/wiki/Universally_unique_identifier
[2] 票证服务器:廉价的分布式唯一主键:
[2] Ticket Servers: Distributed Unique Primary Keys on the Cheap:
https://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
https://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
[3] 宣布雪花:https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html
[3] Announcing Snowflake: https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html
[4] 网络时间协议:https://en.wikipedia.org/wiki/Network_Time_Protocol
[4] Network time protocol: https://en.wikipedia.org/wiki/Network_Time_Protocol
在本章中,我们将解决一个有趣且经典的系统设计面试问题:设计一个像tinyurl这样的URL缩短服务。
In this chapter, we will tackle an interesting and classic system design interview question: designing a URL shortening service like tinyurl.
系统设计面试问题有意保留开放式。要设计一个精心设计的系统,提出澄清问题至关重要。
System design interview questions are intentionally left open-ended. To design a well-crafted system, it is critical to ask clarification questions.
候选人:您能举例说明 URL 缩短器的工作原理吗?
Candidate: Can you give an example of how a URL shortener work?
面试官:假设URL https://www.systeminterview.com/q=chatsystem&c=loggedin&v=v3&l=long 是原始URL。您的服务创建长度较短的别名:https://tinyurl.com/y7keocwj。如果单击别名,它会将您重定向到原始 URL。
Interviewer: Assume URL https://www.systeminterview.com/q=chatsystem&c=loggedin&v=v3&l=long is the original URL. Your service creates an alias with shorter length: https://tinyurl.com/ y7keocwj. If you click the alias, it redirects you to the original URL.
应聘者:交通量是多少?
Candidate: What is the traffic volume?
面试官:每天生成1亿个URL。
Interviewer: 100 million URLs are generated per day.
应聘者:缩短的网址有多长?
Candidate: How long is the shortened URL?
面试官:越短越好。
Interviewer: As short as possible.
候选人:缩短的 URL 中允许使用哪些字符?
Candidate: What characters are allowed in the shortened URL?
面试官:缩短的URL可以是数字(0-9)和字符(az、AZ)的组合。
Interviewer: Shortened URL can be a combination of numbers (0-9) and characters (a-z, A-Z).
考生:缩短的网址可以删除或更新吗?
Candidate: Can shortened URLs be deleted or updated?
面试官:为了简单起见,我们假设缩短的 URL 无法删除或更新。
Interviewer: For simplicity, let us assume shortened URLs cannot be deleted or updated.
以下是基本用例:
Here are the basic use cases:
1.URL缩短:给定一个长URL =>返回一个更短的URL
1.URL shortening: given a long URL => return a much shorter URL
2.URL重定向:给定一个较短的URL => 重定向到原始URL
2.URL redirecting: given a shorter URL => redirect to the original URL
3.High availability, scalability, and fault tolerance considerations
•写操作:每天生成1 亿个URL。
•Write operation: 100 million URLs are generated per day.
•每秒写入操作数:1 亿/24/3600 = 1160
•Write operation per second: 100 million / 24 /3600 = 1160
•读操作:假设读操作与写操作的比例为10:1,每秒读操作数:1160 * 10 = 11,600
•Read operation: Assuming ratio of read operation to write operation is 10:1, read operation per second: 1160 * 10 = 11,600
•假设URL 缩短服务将运行10 年,这意味着我们必须支持1 亿* 365= 365 亿条记录。
•Assuming the URL shortener service will run for 10 years, this means we must support 100 million * 365= 36.5 billion records.
•假设平均URL 长度为100。
•Assume average URL length is 100.
• 10 年以上存储需求:3650 亿 * 100 字节 * 10 年 = 365 TB
•Storage requirement over 10 years: 365 billion * 100 bytes * 10 years = 365 TB
对你来说,与面试官一起完成假设和计算非常重要,这样你们就能达成共识。
It is important for you to walk through the assumptions and calculations with your interviewer so that both of you are on the same page.
在本节中,我们将讨论 API 端点、URL 重定向和 URL 缩短流程。
In this section, we discuss the API endpoints, URL redirecting, and URL shortening flows.
API端点促进客户端和服务器之间的通信。我们将设计 REST 风格的 API。如果你对restful API不熟悉,可以查阅外部资料,比如参考资料[1]中的资料。URL 缩短器主要需要两个 API 端点。
API endpoints facilitate the communication between clients and servers. We will design the APIs REST-style. If you are unfamiliar with restful API, you can consult external materials, such as the one in the reference material [1]. A URL shortener primary needs two API endpoints.
1.URL缩短。要创建新的短 URL,客户端会发送一个 POST 请求,其中包含一个参数:原始长 URL。API 如下所示:
1.URL shortening. To create a new short URL, a client sends a POST request, which contains one parameter: the original long URL. The API looks like this:
POST api/v1/数据/缩短
POST api/v1/data/shorten
•请求参数:{longUrl: longURLString}
•request parameter: {longUrl: longURLString}
•返回短URL
•return shortURL
2.URL重定向。要将短 URL 重定向到相应的长 URL,客户端会发送 GET 请求。API 如下所示:
2.URL redirecting. To redirect a short URL to the corresponding long URL, a client sends a GET request. The API looks like this:
获取 api/v1/shortUrl
GET api/v1/shortUrl
•返回 HTTP 重定向的长 URL
•Return longURL for HTTP redirection
图 8-1 显示了当您在浏览器中输入tinyurl 时会发生什么。一旦服务器收到tinyurl请求,它就会通过301重定向将短URL更改为长URL。
Figure 8-1 shows what happens when you enter a tinyurl onto the browser. Once the server receives a tinyurl request, it changes the short URL to the long URL with 301 redirect.
客户端和服务器之间的详细通信如图8-2所示。
The detailed communication between clients and servers is shown in Figure 8-2.
这里值得讨论的一件事是 301 重定向与 302 重定向。
One thing worth discussing here is 301 redirect vs 302 redirect.
301重定向。301重定向表明请求的URL被“永久”移动到长URL。由于它是永久重定向的,因此浏览器会缓存响应,并且对同一 URL 的后续请求将不会发送到 URL 缩短服务。相反,请求会直接重定向到长 URL 服务器。
301 redirect. A 301 redirect shows that the requested URL is “permanently” moved to the long URL. Since it is permanently redirected, the browser caches the response, and subsequent requests for the same URL will not be sent to the URL shortening service. Instead, requests are redirected to the long URL server directly.
302 重定向。302重定向意味着URL“暂时”移动到长URL,这意味着对同一URL的后续请求将首先发送到URL缩短服务。然后,它们被重定向到长 URL 服务器。
302 redirect. A 302 redirect means that the URL is “temporarily” moved to the long URL, meaning that subsequent requests for the same URL will be sent to the URL shortening service first. Then, they are redirected to the long URL server.
每种重定向方法都有其优点和缺点。如果优先考虑的是减少服务器负载,则使用 301 重定向是有意义的,因为只有同一 URL 的第一个请求才会发送到 URL 缩短服务器。但是,如果分析很重要,302 重定向是更好的选择,因为它可以更轻松地跟踪点击率和点击来源。
Each redirection method has its pros and cons. If the priority is to reduce the server load, using 301 redirect makes sense as only the first request of the same URL is sent to URL shortening servers. However, if analytics is important, 302 redirect is a better choice as it can track click rate and source of the click more easily.
实现URL重定向最直观的方法就是使用哈希表。假设哈希表存储<shortURL, longURL>对,URL重定向可以通过以下方式实现:
The most intuitive way to implement URL redirecting is to use hash tables. Assuming the hash table stores <shortURL, longURL> pairs, URL redirecting can be implemented by the following:
•获取长URL:longURL = hashTable.get(shortURL)
•Get longURL: longURL = hashTable.get(shortURL)
•获得长URL 后,执行URL 重定向。
•Once you get the longURL, perform the URL redirect.
让我们假设短 URL 如下所示: www.tinyurl.com/ {hashValue} 。为了支持 URL 缩短用例,我们必须找到一个哈希函数,将长 URL 映射到hashValue ,如图 8-3 所示。
Let us assume the short URL looks like this: www.tinyurl.com/{hashValue}. To support the URL shortening use case, we must find a hash function fx that maps a long URL to the hashValue, as shown in Figure 8-3.
哈希函数必须满足以下要求:
The hash function must satisfy the following requirements:
•每个longURL必须散列为一个hashValue 。
•Each longURL must be hashed to one hashValue.
•每个hashValue都可以映射回longURL 。
•Each hashValue can be mapped back to the longURL.
Detailed design for the hash function is discussed in deep dive.
到目前为止,我们已经讨论了URL 缩短和 URL 重定向的高层设计。在本节中,我们将深入探讨以下内容:数据模型、哈希函数、URL 缩短和 URL 重定向。
Up until now, we have discussed the high-level design of URL shortening and URL redirecting. In this section, we dive deep into the following: data model, hash function, URL shortening and URL redirecting.
在高层设计中,所有内容都存储在哈希表中。这是一个很好的起点;然而,这种方法对于现实系统来说并不可行,因为内存资源有限且昂贵。更好的选择是将<shortURL, longURL>映射存储在关系数据库中。图 8-4 显示了一个简单的数据库表设计。该表的简化版本包含 3 列:id 、ShortURL 、 longURL 。
In the high-level design, everything is stored in a hash table. This is a good starting point; however, this approach is not feasible for real-world systems as memory resources are limited and expensive. A better option is to store <shortURL, longURL> mapping in a relational database. Figure 8-4 shows a simple database table design. The simplified version of the table contains 3 columns: id, shortURL, longURL.
哈希函数用于将长 URL 哈希为短 URL,也称为hashValue 。
Hash function is used to hash a long URL to a short URL, also known as hashValue.
hashValue由 [0-9, az, AZ] 中的字符组成,包含 10 + 26 + 26 = 62 个可能的字符。要计算hashValue的长度,请找到最小的n ,使得62^n ≥ 3650 亿。根据粗略估计,系统必须支持多达 3650 亿个 URL。表8-1显示了hashValue的长度以及它可以支持的最大URL数量。
The hashValue consists of characters from [0-9, a-z, A-Z], containing 10 + 26 + 26 = 62 possible characters. To figure out the length of hashValue, find the smallest n such that 62^n ≥ 365 billion. The system must support up to 365 billion URLs based on the back of the envelope estimation. Table 8-1 shows the length of hashValue and the corresponding maximal number of URLs it can support.
当n=7时,62^n=~3.5万亿,3.5万亿足以容纳3650亿个URL,因此hashValue的长度为7。
When n = 7, 62 ^ n = ~3.5 trillion, 3.5 trillion is more than enough to hold 365 billion URLs, so the length of hashValue is 7.
我们将探索两种类型的 URL 缩短器哈希函数。第一个是“哈希+冲突解决”,第二个是“base 62转换”。让我们一一看看。
We will explore two types of hash functions for a URL shortener. The first one is “hash + collision resolution”, and the second one is “base 62 conversion.” Let us look at them one by one.
为了缩短长 URL,我们应该实现一个哈希函数,将长 URL 哈希为 7 个字符的字符串。一个简单的解决方案是使用众所周知的哈希函数,例如 CRC32、MD5 或 SHA-1。下表比较了在此 URL 上应用不同哈希函数后的哈希结果:https://en.wikipedia.org/wiki/Systems_design。
To shorten a long URL, we should implement a hash function that hashes a long URL to a 7-character string. A straightforward solution is to use well-known hash functions like CRC32, MD5, or SHA-1. The following table compares the hash results after applying different hash functions on this URL: https://en.wikipedia.org/wiki/Systems_design.
如表 8-2 所示,即使是最短的哈希值(来自 CRC32)也太长(超过 7 个字符)。我们怎样才能让它变得更短呢?
As shown in Table 8-2, even the shortest hash value (from CRC32) is too long (more than 7 characters). How can we make it shorter?
第一种方法是收集哈希值的前 7 个字符;然而,这种方法可能会导致哈希冲突。为了解决哈希冲突,我们可以递归地附加一个新的预定义字符串,直到不再发现冲突。图 8-5 解释了该过程。
The first approach is to collect the first 7 characters of a hash value; however, this method can lead to hash collisions. To resolve hash collisions, we can recursively append a new predefined string until no more collision is discovered. This process is explained in Figure 8-5.
该方法可以消除碰撞;然而,查询数据库来检查每个请求是否存在短 URL 的成本很高。一种称为布隆过滤器 [2] 的技术可以提高性能。布隆过滤器是一种节省空间的概率技术,用于测试元素是否是集合的成员。更多详情请参阅参考资料[2]。
This method can eliminate collision; however, it is expensive to query the database to check if a shortURL exists for every request. A technique called bloom filters [2] can improve performance. A bloom filter is a space-efficient probabilistic technique to test if an element is a member of a set. Refer to the reference material [2] for more details.
基本转换是 URL 缩短器常用的另一种方法。基数转换有助于在不同的数字表示系统之间转换相同的数字。使用 Base 62 转换,因为hashValue有 62 个可能的字符。让我们用一个例子来解释一下转换是如何进行的:将 11157 10转换为以 62 为基数的表示形式(11157 10表示以 10 为基数的系统中的 11157)。
Base conversion is another approach commonly used for URL shorteners. Base conversion helps to convert the same number between its different number representation systems. Base 62 conversion is used as there are 62 possible characters for hashValue. Let us use an example to explain how the conversion works: convert 1115710 to base 62 representation (1115710 represents 11157 in a base 10 system).
•从其名称来看,base 62 是一种使用62 个字符进行编码的方法。映射为:0-0、...、9-9、10-a、11-b、...、35-z、36-A、...、61-Z,其中“ a”代表10,“Z”代表61等。
•From its name, base 62 is a way of using 62 characters for encoding. The mappings are: 0-0, ..., 9-9, 10-a, 11-b, ..., 35-z, 36-A, ..., 61-Z, where ‘a’ stands for 10, ‘Z’ stands for 61, etc.
• 11157 10 = 2 x 62 2 + 55 x 62 1 + 59 x 62 0 = [2, 55, 59] -> [2, T, X](以 62 为基数表示)。会话流程如图8-6所示。
•1115710 = 2 x 622 + 55 x 621 + 59 x 620 = [2, 55, 59] -> [2, T, X] in base 62 representation. Figure 8-6 shows the conversation process.
•因此,短 URL 为https://tinyurl.com/2TX
•Thus, the short URL is https://tinyurl.com /2TX
表 8-3 显示了两种方法的差异。
Table 8-3 shows the differences of the two approaches.
作为系统的核心部分之一,我们希望 URL 缩短流程在逻辑上简单且实用。我们的设计中使用了 Base 62 转换。我们构建下图(图 8-7)来演示该流程。
As one of the core pieces of the system, we want the URL shortening flow to be logically simple and functional. Base 62 conversion is used in our design. We build the following diagram (Figure 8-7) to demonstrate the flow.
1. longURL 是输入。
1. longURL is the input.
2. 系统检查longURL是否在数据库中。
2. The system checks if the longURL is in the database.
3. 如果是,则说明之前已将长URL 转换为短URL。在这种情况下,从数据库中获取短URL并将其返回给客户端。
3. If it is, it means the longURL was converted to shortURL before. In this case, fetch the shortURL from the database and return it to the client.
4. 如果不是,则 longURL 是新的。新的唯一ID(主键)由唯一ID生成器生成。
4. If not, the longURL is new. A new unique ID (primary key) Is generated by the unique ID generator.
5. 通过 Base 62 转换将 ID 转换为短 URL。
5. Convert the ID to shortURL with base 62 conversion.
6. 使用 ID、shortURL 和 longURL 创建一个新的数据库行。
6. Create a new database row with the ID, shortURL, and longURL.
To make the flow easier to understand, let us look at a concrete example.
•假设输入长URL 为:https://en.wikipedia.org/wiki/Systems_design
•Assuming the input longURL is: https://en.wikipedia.org/wiki/Systems_design
•Unique ID generator returns ID: 2009215674938.
•使用base 62 转换将ID 转换为shortURL。ID (2009215674938) 转换为“zn9edcu”。
•Convert the ID to shortURL using the base 62 conversion. ID (2009215674938) is converted to “zn9edcu”.
•将ID、shortURL 和longURL 保存到数据库中,如表8-4 所示。
•Save ID, shortURL, and longURL to the database as shown in Table 8-4.
分布式唯一ID生成器值得一提。它的主要功能是生成全局唯一的 ID,用于创建短 URL。在高度分布式的环境中,实现唯一的 ID 生成器具有挑战性。幸运的是,我们已经在“第 7 章:在分布式系统中设计唯一 ID 生成器”中讨论了一些解决方案。您可以回顾一下以刷新您的记忆。
The distributed unique ID generator is worth mentioning. Its primary function is to generate globally unique IDs, which are used for creating shortURLs. In a highly distributed environment, implementing a unique ID generator is challenging. Luckily, we have already discussed a few solutions in “Chapter 7: Design A Unique ID Generator in Distributed Systems”. You can refer back to it to refresh your memory.
URL重定向的详细设计如图8-8所示。由于读取多于写入,<shortURL, longURL>映射存储在缓存中以提高性能。
Figure 8-8 shows the detailed design of the URL redirecting. As there are more reads than writes, <shortURL, longURL> mapping is stored in a cache to improve performance.
URL重定向的流程总结如下:
The flow of URL redirecting is summarized as follows:
1. 用户单击短 URL 链接:https://tinyurl.com/zn9edcu
1. A user clicks a short URL link: https://tinyurl.com/zn9edcu
2. 负载均衡器将请求转发到Web 服务器。
2. The load balancer forwards the request to web servers.
3. 如果缓存中已存在短URL,则直接返回长URL。
3. If a shortURL is already in the cache, return the longURL directly.
4. 如果短URL 不在缓存中,则从数据库中获取长URL。如果数据库中没有,则用户可能输入了无效的短 URL。
4. If a shortURL is not in the cache, fetch the longURL from the database. If it is not in the database, it is likely a user entered an invalid shortURL.
在本章中,我们讨论了 API 设计、数据模型、哈希函数、URL 缩短和 URL 重定向。
In this chapter, we talked about the API design, data model, hash function, URL shortening, and URL redirecting.
如果采访结束时还有额外的时间,这里有一些额外的谈话要点。
If there is extra time at the end of the interview, here are a few additional talking points.
•速率限制器:我们可能面临的一个潜在安全问题是恶意用户发送大量 URL 缩短请求。速率限制器有助于根据 IP 地址或其他过滤规则过滤请求。如果您想刷新有关速率限制的记忆,请参阅“第 4 章:设计速率限制器”。
•Rate limiter: A potential security problem we could face is that malicious users send an overwhelmingly large number of URL shortening requests. Rate limiter helps to filter out requests based on IP address or other filtering rules. If you want to refresh your memory about rate limiting, refer to “Chapter 4: Design a rate limiter”.
• Web 服务器扩展:由于Web 层是无状态的,因此可以通过添加或删除Web 服务器轻松扩展Web 层。
•Web server scaling: Since the web tier is stateless, it is easy to scale the web tier by adding or removing web servers.
•数据库扩展:数据库复制和分片是常用技术。
•Database scaling: Database replication and sharding are common techniques.
•分析:数据对于业务成功越来越重要。将分析解决方案集成到 URL 缩短器可以帮助回答重要问题,例如有多少人点击链接?他们什么时候点击链接?ETC。
•Analytics: Data is increasingly important for business success. Integrating an analytics solution to the URL shortener could help to answer important questions like how many people click on a link? When do they click the link? etc.
•可用性、一致性和可靠性。这些概念是任何大型系统成功的核心。我们在第 1 章中详细讨论了它们,请刷新您对这些主题的记忆。
•Availability, consistency, and reliability. These concepts are at the core of any large system’s success. We discussed them in detail in Chapter 1, please refresh your memory on these topics.
恭喜您已经走到这一步了!现在拍拍自己的背吧。好工作!
Congratulations on getting this far! Now give yourself a pat on the back. Good job!
参考资料
Reference materials
[1] RESTful 教程:https://www.restapitutorial.com/index.html
[1] A RESTful Tutorial: https://www.restapitutorial.com/index.html
[2]布隆过滤器:https://en.wikipedia.org/wiki/Bloom_filter
[2] Bloom filter: https://en.wikipedia.org/wiki/Bloom_filter
本章我们重点讨论网络爬虫设计:一个有趣且经典的系统设计面试问题。
In this chapter, we focus on web crawler design: an interesting and classic system design interview question.
网络爬虫被称为机器人或蜘蛛。搜索引擎广泛使用它来发现网络上的新内容或更新内容。内容可以是网页、图像、视频、PDF 文件等。网络爬虫首先收集一些网页,然后跟踪这些页面上的链接来收集新内容。图 9-1 显示了抓取过程的直观示例。
A web crawler is known as a robot or spider. It is widely used by search engines to discover new or updated content on the web. Content can be a web page, an image, a video, a PDF file, etc. A web crawler starts by collecting a few web pages and then follows links on those pages to collect new content. Figure 9-1 shows a visual example of the crawl process.
爬虫有多种用途:
A crawler is used for many purposes:
•搜索引擎索引:这是最常见的用例。爬虫收集网页以为搜索引擎创建本地索引。例如,Googlebot 是 Google 搜索引擎背后的网络爬虫。
•Search engine indexing: This is the most common use case. A crawler collects web pages to create a local index for search engines. For example, Googlebot is the web crawler behind the Google search engine.
•网络归档:这是从网络收集信息以保存数据以供将来使用的过程。例如,许多国家图书馆都运行爬虫来归档网站。著名的例子是美国国会图书馆 [1] 和欧盟网络档案馆 [2]。
•Web archiving: This is the process of collecting information from the web to preserve data for future uses. For instance, many national libraries run crawlers to archive web sites. Notable examples are the US Library of Congress [1] and the EU web archive [2].
• Web 挖掘:Web 的爆炸性增长为数据挖掘提供了前所未有的机会。网络挖掘有助于从互联网上发现有用的知识。例如,顶级金融公司使用爬虫下载股东会议和年度报告,以了解公司的关键举措。
•Web mining: The explosive growth of the web presents an unprecedented opportunity for data mining. Web mining helps to discover useful knowledge from the internet. For example, top financial firms use crawlers to download shareholder meetings and annual reports to learn key company initiatives.
•网络监控。爬虫有助于监控互联网上的版权和商标侵权行为。例如,Digimarc [3] 利用爬虫来发现盗版作品和报告。
•Web monitoring. The crawlers help to monitor copyright and trademark infringements over the Internet. For example, Digimarc [3] utilizes crawlers to discover pirated works and reports.
开发网络爬虫的复杂性取决于我们打算支持的规模。它可以是一个只需几个小时即可完成的小型学校项目,也可以是一个需要专门的工程团队不断改进的大型项目。因此,我们将探讨以下支持的规模和功能。
The complexity of developing a web crawler depends on the scale we intend to support. It could be either a small school project, which takes only a few hours to complete or a gigantic project that requires continuous improvement from a dedicated engineering team. Thus, we will explore the scale and features to support below.
网络爬虫的基本算法很简单:
The basic algorithm of a web crawler is simple:
1. 给定一组 URL,下载该 URL 所寻址的所有网页。
1. Given a set of URLs, download all the web pages addressed by the URLs.
2. 从这些网页中提取URL
2. Extract URLs from these web pages
3. 将新的 URL 添加到要下载的 URL 列表中。重复这 3 个步骤。
3. Add new URLs to the list of URLs to be downloaded. Repeat these 3 steps.
网络爬虫的工作原理真的像这个基本算法一样简单吗?不完全是。设计一个高度可扩展的网络爬虫是一项极其复杂的任务。任何人都不可能在面试期间设计出一个大型的网络爬虫。在开始设计之前,我们必须提出问题以了解需求并确定设计范围:
Does a web crawler work truly as simple as this basic algorithm? Not exactly. Designing a vastly scalable web crawler is an extremely complex task. It is unlikely for anyone to design a massive web crawler within the interview duration. Before jumping into the design, we must ask questions to understand the requirements and establish design scope:
候选人:爬虫的主要目的是什么?它用于搜索引擎索引、数据挖掘还是其他用途?
Candidate: What is the main purpose of the crawler? Is it used for search engine indexing, data mining, or something else?
面试官:搜索引擎索引。
Interviewer: Search engine indexing.
考生:网络爬虫每月收集多少个网页?
Candidate: How many web pages does the web crawler collect per month?
采访者:10亿页。
Interviewer: 1 billion pages.
候选人:包括哪些内容类型?仅 HTML 还是其他内容类型(例如 PDF 和图像)?
Candidate: What content types are included? HTML only or other content types such as PDFs and images as well?
面试官:只有 HTML。
Interviewer: HTML only.
候选人:我们可以考虑新添加或编辑的网页吗?
Candidate: Shall we consider newly added or edited web pages?
面试官:是的,我们应该考虑新增或编辑的网页。
Interviewer: Yes, we should consider the newly added or edited web pages.
考生:我们需要存储从网络爬取的 HTML 页面吗?
Candidate: Do we need to store HTML pages crawled from the web?
面试官:是的,最长5年
Interviewer: Yes, up to 5 years
考生:我们如何处理重复内容的网页?
Candidate: How do we handle web pages with duplicate content?
面试官:内容重复的页面应该被忽略。
Interviewer: Pages with duplicate content should be ignored.
以上是您可以向面试官提出的一些示例问题。了解要求并澄清歧义非常重要。即使你被要求设计一个简单的产品,比如网络爬虫,你和你的面试官可能不会有相同的假设。
Above are some of the sample questions that you can ask your interviewer. It is important to understand the requirements and clarify ambiguities. Even if you are asked to design a straightforward product like a web crawler, you and your interviewer might not have the same assumptions.
除了向面试官澄清的功能之外,记下优秀网络爬虫的以下特征也很重要:
Beside functionalities to clarify with your interviewer, it is also important to note down the following characteristics of a good web crawler:
•可扩展性:网络非常大。那里有数十亿个网页。使用并行化,网络爬行应该非常高效。
•Scalability: The web is very large. There are billions of web pages out there. Web crawling should be extremely efficient using parallelization.
•稳健性:网络中充满了陷阱。错误的 HTML、服务器无响应、崩溃、恶意链接等都很常见。爬虫必须处理所有这些边缘情况。
•Robustness: The web is full of traps. Bad HTML, unresponsive servers, crashes, malicious links, etc. are all common. The crawler must handle all those edge cases.
•礼貌:爬虫不应在短时间内向网站发出过多请求。
•Politeness: The crawler should not make too many requests to a website within a short time interval.
•可扩展性:系统非常灵活,因此只需进行最少的更改即可支持新的内容类型。例如,如果我们将来想要爬取图像文件,我们应该不需要重新设计整个系统。
•Extensibility: The system is flexible so that minimal changes are needed to support new content types. For example, if we want to crawl image files in the future, we should not need to redesign the entire system.
以下估计基于许多假设,与面试官沟通以达成共识非常重要。
The following estimations are based on many assumptions, and it is important to communicate with the interviewer to be on the same page.
•假设每月下载 10 亿个网页。
•Assume 1 billion web pages are downloaded every month.
• QPS:1,000,000,000 / 30 天/24 小时/3600 秒= 每秒约400 页。
•QPS: 1,000,000,000 / 30 days / 24 hours / 3600 seconds = ~400 pages per second.
•峰值 QPS = 2 * QPS = 800
• Peak QPS = 2 * QPS = 800
•假设平均网页大小为500k。
•Assume the average web page size is 500k.
• 10 亿页 x 500k = 每月 500 TB 存储。如果您对数字存储单元不清楚,请再次阅读第 2 章中的“2 的幂”部分。
•1-billion-page x 500k = 500 TB storage per month. If you are unclear about digital storage units, go through “Power of 2” section in Chapter 2 again.
•假设数据存储五年,则500 TB * 12 个月* 5 年= 30 PB。存储五年的内容需要 30 PB 存储空间。
•Assuming data are stored for five years, 500 TB * 12 months * 5 years = 30 PB. A 30 PB storage is needed to store five-year content.
一旦需求明确,我们就开始进行高层设计。受到之前网络爬行研究[4][5]的启发,我们提出了如图9-2所示的高层设计。
Once the requirements are clear, we move on to the high-level design. Inspired by previous studies on web crawling [4] [5], we propose a high-level design as shown in Figure 9-2.
首先,我们探索每个设计组件以了解它们的功能。然后,我们逐步检查爬虫工作流程。
First, we explore each design component to understand their functionalities. Then, we examine the crawler workflow step-by-step.
种子网址
Seed URLs
网络爬虫使用种子 URL 作为爬网过程的起点。例如,要抓取大学网站的所有网页,选择种子URL的直观方法是使用大学的域名。
A web crawler uses seed URLs as a starting point for the crawl process. For example, to crawl all web pages from a university’s website, an intuitive way to select seed URLs is to use the university’s domain name.
要抓取整个网络,我们需要创造性地选择种子 URL。一个好的种子 URL 可以作为一个良好的起点,爬虫可以利用它来遍历尽可能多的链接。一般策略是将整个 URL 空间划分为更小的空间。第一种提出的方法是基于地点的,因为不同的国家可能有不同的流行网站。另一种方法是根据主题选择种子 URL;例如,我们可以将 URL 空间划分为购物、体育、医疗保健等。种子 URL 的选择是一个开放式问题。您不需要给出完美的答案。大声思考。
To crawl the entire web, we need to be creative in selecting seed URLs. A good seed URL serves as a good starting point that a crawler can utilize to traverse as many links as possible. The general strategy is to divide the entire URL space into smaller ones. The first proposed approach is based on locality as different countries may have different popular websites. Another way is to choose seed URLs based on topics; for example, we can divide URL space into shopping, sports, healthcare, etc. Seed URL selection is an open-ended question. You are not expected to give the perfect answer. Just think out loud.
网址前沿
URL Frontier
大多数现代网络爬虫将爬行状态分为两种:待下载和已下载。存储要下载的 URL 的组件称为 URL Frontier。您可以将其称为先进先出 (FIFO) 队列。有关 URL Frontier 的详细信息,请参阅深入探讨。
Most modern web crawlers split the crawl state into two: to be downloaded and already downloaded. The component that stores URLs to be downloaded is called the URL Frontier. You can refer to this as a First-in-First-out (FIFO) queue. For detailed information about the URL Frontier, refer to the deep dive.
HTML下载器
HTML Downloader
HTML 下载器从 Internet 下载网页。这些 URL 由 URL Frontier 提供。
The HTML downloader downloads web pages from the internet. Those URLs are provided by the URL Frontier.
DNS解析器
DNS Resolver
要下载网页,必须将 URL 转换为 IP 地址。HTML 下载器调用 DNS 解析器来获取 URL 对应的 IP 地址。例如,自 2019 年 3 月 5 日起,URL www.wikipedia.org 已转换为 IP 地址 198.35.26.96。
To download a web page, a URL must be translated into an IP address. The HTML Downloader calls the DNS Resolver to get the corresponding IP address for the URL. For instance, URL www.wikipedia.org is converted to IP address 198.35.26.96 as of 3/5/2019.
内容解析器
Content Parser
下载网页后,必须对其进行解析和验证,因为格式错误的网页可能会引发问题并浪费存储空间。在爬网服务器中实现内容解析器会减慢爬网过程。因此,内容解析器是一个单独的组件。
After a web page is downloaded, it must be parsed and validated because malformed web pages could provoke problems and waste storage space. Implementing a content parser in a crawl server will slow down the crawling process. Thus, the content parser is a separate component.
内容已看过?
Content Seen?
在线研究[6]显示,29%的网页是重复的内容,这可能会导致相同的内容被多次存储。我们介绍“看到的内容?” 数据结构消除数据冗余并缩短处理时间。它有助于检测以前存储在系统中的新内容。要比较两个 HTML 文档,我们可以逐个字符地比较它们。然而,这种方法速度慢且耗时,尤其是当涉及数十亿个网页时。完成此任务的一个有效方法是比较两个网页的哈希值[7]。
Online research [6] reveals that 29% of the web pages are duplicated contents, which may cause the same content to be stored multiple times. We introduce the “Content Seen?” data structure to eliminate data redundancy and shorten processing time. It helps to detect new content previously stored in the system. To compare two HTML documents, we can compare them character by character. However, this method is slow and time-consuming, especially when billions of web pages are involved. An efficient way to accomplish this task is to compare the hash values of the two web pages [7].
内容存储
Content Storage
它是一个用于存储HTML内容的存储系统。存储系统的选择取决于数据类型、数据大小、访问频率、寿命等因素。磁盘和内存都会使用。
It is a storage system for storing HTML content. The choice of storage system depends on factors such as data type, data size, access frequency, life span, etc. Both disk and memory are used.
•大多数内容存储在磁盘上,因为数据集太大,无法放入内存。
•Most of the content is stored on disk because the data set is too big to fit in memory.
•热门内容保留在内存中以减少延迟。
•Popular content is kept in memory to reduce latency.
网址提取器
URL Extractor
URL Extractor 从 HTML 页面中解析并提取链接。图 9-3 显示了链接提取过程的示例。通过添加“https://en.wikipedia.org”前缀,相对路径将转换为绝对 URL。
URL Extractor parses and extracts links from HTML pages. Figure 9-3 shows an example of a link extraction process. Relative paths are converted to absolute URLs by adding the “https://en.wikipedia.org” prefix.
网址过滤器
URL Filter
URL 过滤器排除某些内容类型、文件扩展名、错误链接和“黑名单”站点中的 URL。
The URL filter excludes certain content types, file extensions, error links and URLs in “blacklisted” sites.
网址 看到了吗?
URL Seen?
“网址看到了吗?” 是一种数据结构,用于跟踪 Frontier 之前或已经访问过的 URL。“网址看到了吗?” 有助于避免多次添加相同的 URL,因为这会增加服务器负载并导致潜在的无限循环。
“URL Seen?” is a data structure that keeps track of URLs that are visited before or already in the Frontier. “URL Seen?” helps to avoid adding the same URL multiple times as this can increase server load and cause potential infinite loops.
布隆过滤器和哈希表是实现“URL Seen?”的常用技术。成分。这里我们不会介绍布隆过滤器和哈希表的详细实现。更多信息请参阅参考资料[4][8]。
Bloom filter and hash table are common techniques to implement the “URL Seen?” component. We will not cover the detailed implementation of the bloom filter and hash table here. For more information, refer to the reference materials [4] [8].
网址存储
URL Storage
URL 存储存储已访问过的 URL。
URL Storage stores already visited URLs.
到目前为止,我们已经讨论了每个系统组件。接下来,我们将它们放在一起来解释工作流程。
So far, we have discussed every system component. Next, we put them together to explain the workflow.
网络爬虫工作流程
Web crawler workflow
为了更好地逐步解释工作流程,在设计图中添加了序列号,如图 9-4 所示。
To better explain the workflow step-by-step, sequence numbers are added in the design diagram as shown in Figure 9-4.
第 1 步:将种子 URL 添加到 URL Frontier
Step 1: Add seed URLs to the URL Frontier
步骤 2:HTML 下载器从 URL Frontier 获取 URL 列表。
Step 2: HTML Downloader fetches a list of URLs from URL Frontier.
步骤 3:HTML 下载器从 DNS 解析器获取 URL 的 IP 地址并开始下载。
Step 3: HTML Downloader gets IP addresses of URLs from DNS resolver and starts downloading.
步骤 4:内容解析器解析 HTML 页面并检查页面是否格式错误。
Step 4: Content Parser parses HTML pages and checks if pages are malformed.
第 5 步:内容经过解析和验证后,将被传递到“已看到内容?” 成分。
Step 5: After content is parsed and validated, it is passed to the “Content Seen?” component.
第 6 步:“Content Seen”组件检查 HTML 页面是否已在存储中。
Step 6: “Content Seen” component checks if a HTML page is already in the storage.
•如果在存储中,则表示不同URL 中的相同内容已被处理。在这种情况下,HTML 页面将被丢弃。
•If it is in the storage, this means the same content in a different URL has already been processed. In this case, the HTML page is discarded.
•如果不在存储器中,则系统之前没有处理过相同的内容。内容被传递到链接提取器。
•If it is not in the storage, the system has not processed the same content before. The content is passed to Link Extractor.
第 7 步:链接提取器从 HTML 页面中提取链接。
Step 7: Link extractor extracts links from HTML pages.
步骤8:提取的链接被传递到URL过滤器。
Step 8: Extracted links are passed to the URL filter.
步骤9:链接过滤后,将被传递到“URL Seen?” 成分。
Step 9: After links are filtered, they are passed to the “URL Seen?” component.
步骤10:“URL Seen”组件检查URL是否已在存储中,如果是,则之前已处理过,无需执行任何操作。
Step 10: “URL Seen” component checks if a URL is already in the storage, if yes, it is processed before, and nothing needs to be done.
步骤 11:如果 URL 之前未被处理过,则将其添加到 URL Frontier。
Step 11: If a URL has not been processed before, it is added to the URL Frontier.
到目前为止,我们已经讨论了高层设计。接下来,我们将深入讨论最重要的建筑构件和技术:
Up until now, we have discussed the high-level design. Next, we will discuss the most important building components and techniques in depth:
•深度优先搜索 (DFS) 与广度优先搜索 (BFS)
•Depth-first search (DFS) vs Breadth-first search (BFS)
• URL 边界
•URL frontier
• HTML 下载器
•HTML Downloader
•稳健性
•Robustness
•可扩展性
•Extensibility
•检测并避免有问题的内容
•Detect and avoid problematic content
您可以将网络视为有向图,其中网页充当节点,超链接 (URL) 充当边。抓取过程可以看作是从一个网页到其他网页遍历有向图。两种常见的图遍历算法是DFS和BFS。然而,DFS 通常不是一个好的选择,因为 DFS 的深度可能非常深。
You can think of the web as a directed graph where web pages serve as nodes and hyperlinks (URLs) as edges. The crawl process can be seen as traversing a directed graph from one web page to others. Two common graph traversal algorithms are DFS and BFS. However, DFS is usually not a good choice because the depth of DFS can be very deep.
BFS 常用于网络爬虫,通过先进先出 (FIFO) 队列实现。在 FIFO 队列中,URL 按其入队顺序出队。然而,这种实现有两个问题:
BFS is commonly used by web crawlers and is implemented by a first-in-first-out (FIFO) queue. In a FIFO queue, URLs are dequeued in the order they are enqueued. However, this implementation has two problems:
•来自同一网页的大多数链接都链接回同一主机。在图9-5中,wikipedia.com中的所有链接都是内部链接,使得爬虫忙于处理来自同一主机(wikipedia.com)的URL。当爬虫尝试并行下载网页时,维基百科服务器将被请求淹没。这被认为是“不礼貌的”。
•Most links from the same web page are linked back to the same host. In Figure 9-5, all the links in wikipedia.com are internal links, making the crawler busy processing URLs from the same host (wikipedia.com). When the crawler tries to download web pages in parallel, Wikipedia servers will be flooded with requests. This is considered as “impolite”.
•标准BFS 不考虑URL 的优先级。网络很大,并非每个页面都具有相同水平的质量和重要性。因此,我们可能希望根据 URL 的页面排名、网络流量、更新频率等对 URL 进行优先级排序。
•Standard BFS does not take the priority of a URL into consideration. The web is large and not every page has the same level of quality and importance. Therefore, we may want to prioritize URLs according to their page ranks, web traffic, update frequency, etc.
URL 边界有助于解决这些问题。URL边界是一种存储要下载的URL的数据结构。URL 边界是确保礼貌、URL 优先级和新鲜度的重要组成部分。参考资料 [5] [9] 中提到了一些关于 URL 前沿的值得注意的论文。这些论文的研究结果如下:
URL frontier helps to address these problems. A URL frontier is a data structure that stores URLs to be downloaded. The URL frontier is an important component to ensure politeness, URL prioritization, and freshness. A few noteworthy papers on URL frontier are mentioned in the reference materials [5] [9]. The findings from these papers are as follows:
一般来说,网络爬虫应避免在短时间内向同一托管服务器发送过多的请求。发送太多请求会被视为“不礼貌”,甚至被视为拒绝服务 (DOS) 攻击。例如,在没有任何限制的情况下,爬虫每秒可以向同一个网站发送数千个请求。这可能会使网络服务器不堪重负。
Generally, a web crawler should avoid sending too many requests to the same hosting server within a short period. Sending too many requests is considered as “impolite” or even treated as denial-of-service (DOS) attack. For example, without any constraint, the crawler can send thousands of requests every second to the same website. This can overwhelm the web servers.
强制礼貌的总体思路是从同一主机一次下载一页。可以在两个下载任务之间添加延迟。礼貌约束是通过维护从网站主机名到下载(工作)线程的映射来实现的。每个下载器线程都有一个单独的 FIFO 队列,并且仅下载从该队列获取的 URL。图 9-6 显示了管理礼貌的设计。
The general idea of enforcing politeness is to download one page at a time from the same host. A delay can be added between two download tasks. The politeness constraint is implemented by maintain a mapping from website hostnames to download (worker) threads. Each downloader thread has a separate FIFO queue and only downloads URLs obtained from that queue. Figure 9-6 shows the design that manages politeness.
•队列路由器:它确保每个队列(b1、b2、...bn)仅包含来自同一主机的URL。
•Queue router: It ensures that each queue (b1, b2, … bn) only contains URLs from the same host.
•映射表:它将每个主机映射到一个队列。
•Mapping table: It maps each host to a queue.
• FIFO 队列b1、b2 至bn:每个队列包含来自同一主机的URL。
•FIFO queues b1, b2 to bn: Each queue contains URLs from the same host.
•队列选择器:每个工作线程都映射到一个 FIFO 队列,并且它仅从该队列下载 URL。队列选择逻辑由队列选择器完成。
•Queue selector: Each worker thread is mapped to a FIFO queue, and it only downloads URLs from that queue. The queue selection logic is done by the Queue selector.
•工作线程1 到N。工作线程从同一主机逐个下载网页。可以在两个下载任务之间添加延迟。
•Worker thread 1 to N. A worker thread downloads web pages one by one from the same host. A delay can be added between two download tasks.
来自讨论论坛的有关苹果产品的随机帖子与苹果主页上的帖子的权重截然不同。尽管它们都有“Apple”关键字,但爬虫首先爬行 Apple 主页是明智的。
A random post from a discussion forum about Apple products carries very different weight than posts on the Apple home page. Even though they both have the “Apple” keyword, it is sensible for a crawler to crawl the Apple home page first.
我们根据有用性对 URL 进行优先级排序,有用性可以通过 PageRank [10]、网站流量、更新频率等来衡量。“Prioritizer”是处理 URL 优先级的组件。有关此概念的深入信息,请参阅参考资料 [5] [10]。
We prioritize URLs based on usefulness, which can be measured by PageRank [10], website traffic, update frequency, etc. “Prioritizer” is the component that handles URL prioritization. Refer to the reference materials [5] [10] for in-depth information about this concept.
图9-7显示了管理URL优先级的设计。
Figure 9-7 shows the design that manages URL priority.
•优先级确定器:它将URL 作为输入并计算优先级。
•Prioritizer: It takes URLs as input and computes the priorities.
•队列f1 到fn:每个队列都有分配的优先级。具有高优先级的队列被选择的概率较高。
•Queue f1 to fn: Each queue has an assigned priority. Queues with high priority are selected with higher probability.
•队列选择器:随机选择一个队列,并偏向于具有较高优先级的队列。
•Queue selector: Randomly choose a queue with a bias towards queues with higher priority.
图9-8展示了URL前沿设计,它包含两个模块:
Figure 9-8 presents the URL frontier design, and it contains two modules:
•前端队列:管理优先级
•Front queues: manage prioritization
•后排:管理礼貌
•Back queues: manage politeness
网页不断地被添加、删除和编辑。网络爬虫必须定期重新爬行下载的页面,以保持我们的数据集最新。重新抓取所有 URL 非常耗时且占用资源。下面列出了一些优化新鲜度的策略:
Web pages are constantly being added, deleted, and edited. A web crawler must periodically recrawl downloaded pages to keep our data set fresh. Recrawl all the URLs is time-consuming and resource intensive. Few strategies to optimize freshness are listed as follows:
•根据网页的更新历史记录重新抓取。
•Recrawl based on web pages’ update history.
•优先考虑URL 并首先且更频繁地重新抓取重要页面。
•Prioritize URLs and recrawl important pages first and more frequently.
在现实世界的搜索引擎抓取中,前沿的 URL 数量可能达到数亿 [4]。将所有内容放入内存既不持久也不可扩展。将所有内容保留在磁盘中也是不可取的,因为磁盘速度很慢;而且很容易成为爬行的瓶颈。
In real-world crawl for search engines, the number of URLs in the frontier could be hundreds of millions [4]. Putting everything in memory is neither durable nor scalable. Keeping everything in the disk is undesirable neither because the disk is slow; and it can easily become a bottleneck for the crawl.
我们采用了混合方法。大多数URL都存储在磁盘上,因此存储空间不是问题。为了减少从磁盘读取和写入磁盘的成本,我们在内存中维护缓冲区以进行入队/出队操作。缓冲区中的数据会定期写入磁盘。
We adopted a hybrid approach. The majority of URLs are stored on disk, so the storage space is not a problem. To reduce the cost of reading from the disk and writing to the disk, we maintain buffers in memory for enqueue/dequeue operations. Data in the buffer is periodically written to the disk.
HTML 下载器使用 HTTP 协议从 Internet 下载网页。在讨论 HTML 下载器之前,我们首先看一下机器人排除协议。
The HTML Downloader downloads web pages from the internet using the HTTP protocol. Before discussing the HTML Downloader, we look at Robots Exclusion Protocol first.
机器人.txt
Robots.txt
Robots.txt,全称为机器人排除协议,是网站与爬虫通信所使用的标准。它指定允许爬虫下载哪些页面。在尝试抓取网站之前,爬虫应首先检查其相应的robots.txt并遵循其规则。
Robots.txt, called Robots Exclusion Protocol, is a standard used by websites to communicate with crawlers. It specifies what pages crawlers are allowed to download. Before attempting to crawl a web site, a crawler should check its corresponding robots.txt first and follow its rules.
为了避免重复下载 robots.txt 文件,我们缓存该文件的结果。该文件会定期下载并保存到缓存中。这是取自 https://www.amazon.com/robots.txt 的 robots.txt 文件。Google bot 不允许使用某些目录(例如 Creatorhub)。
To avoid repeat downloads of robots.txt file, we cache the results of the file. The file is downloaded and saved to cache periodically. Here is a piece of robots.txt file taken from https://www.amazon.com/robots.txt. Some of the directories like creatorhub are disallowed for Google bot.
用户代理:Googlebot
User-agent: Googlebot
禁止:/creatorhub/*
Disallow: /creatorhub/*
禁止:/rss/people/*/reviews
Disallow: /rss/people/*/reviews
禁止:/gp/pdp/rss/*/reviews
Disallow: /gp/pdp/rss/*/reviews
禁止:/gp/cdp/member-reviews/
Disallow: /gp/cdp/member-reviews/
禁止:/gp/aw/cr/
Disallow: /gp/aw/cr/
除了 robots.txt 之外,性能优化是我们将为 HTML 下载器介绍的另一个重要概念。
Besides robots.txt, performance optimization is another important concept we will cover for the HTML downloader.
以下是 HTML 下载器的性能优化列表。
Below is a list of performance optimizations for HTML downloader.
1.分布式抓取
1. Distributed crawl
为了获得高性能,爬网作业被分发到多个服务器上,每个服务器上运行多个线程。URL空间被分割成更小的部分;因此,每个下载器负责 URL 的子集。图 9-9 显示了分布式爬网的示例。
To achieve high performance, crawl jobs are distributed into multiple servers, and each server runs multiple threads. The URL space is partitioned into smaller pieces; so, each downloader is responsible for a subset of the URLs. Figure 9-9 shows an example of a distributed crawl.
2.缓存DNS解析器
2. Cache DNS Resolver
DNS 解析器是爬网程序的瓶颈,因为由于许多 DNS 接口的同步特性,DNS 请求可能需要时间。DNS 响应时间范围为 10 毫秒至 200 毫秒。一旦爬虫线程执行了对 DNS 的请求,其他线程就会被阻塞,直到第一个请求完成。维护我们的 DNS 缓存以避免频繁调用 DNS 是一种有效的速度优化技术。我们的 DNS 缓存保存域名到 IP 地址的映射,并通过 cron 作业定期更新。
DNS Resolver is a bottleneck for crawlers because DNS requests might take time due to the synchronous nature of many DNS interfaces. DNS response time ranges from 10ms to 200ms. Once a request to DNS is carried out by a crawler thread, other threads are blocked until the first request is completed. Maintaining our DNS cache to avoid calling DNS frequently is an effective technique for speed optimization. Our DNS cache keeps the domain name to IP address mapping and is updated periodically by cron jobs.
3. 地点
3. Locality
按地理位置分布爬网服务器。当爬网服务器距离网站主机更近时,爬网程序的下载时间会更快。设计局部性适用于大多数系统组件:爬行服务器、缓存、队列、存储等。
Distribute crawl servers geographically. When crawl servers are closer to website hosts, crawlers experience faster download time. Design locality applies to most of the system components: crawl servers, cache, queue, storage, etc.
4、超时时间短
4. Short timeout
一些网络服务器响应缓慢或者可能根本不响应。为了避免长时间等待,指定了最长等待时间。如果主机在预定时间内没有响应,爬虫将停止作业并爬取其他页面。
Some web servers respond slowly or may not respond at all. To avoid long wait time, a maximal wait time is specified. If a host does not respond within a predefined time, the crawler will stop the job and crawl some other pages.
除了性能优化之外,鲁棒性也是一个重要的考虑因素。我们提出了一些提高系统鲁棒性的方法:
Besides performance optimization, robustness is also an important consideration. We present a few approaches to improve the system robustness:
•一致的散列:这有助于在下载者之间分配负载。可以使用一致哈希来添加或删除新的下载服务器。有关更多详细信息,请参阅第 5 章:设计一致性哈希。
•Consistent hashing: This helps to distribute loads among downloaders. A new downloader server can be added or removed using consistent hashing. Refer to Chapter 5: Design consistent hashing for more details.
•保存爬网状态和数据:为了防止出现故障,爬网状态和数据将写入存储系统。通过加载保存的状态和数据可以轻松重新启动中断的爬网。
•Save crawl states and data: To guard against failures, crawl states and data are written to a storage system. A disrupted crawl can be restarted easily by loading saved states and data.
•异常处理:在大型系统中,错误是不可避免且常见的。爬虫必须优雅地处理异常而不导致系统崩溃。
•Exception handling: Errors are inevitable and common in a large-scale system. The crawler must handle exceptions gracefully without crashing the system.
•Data validation: This is an important measure to prevent system errors.
随着几乎每个系统的发展,设计目标之一是使系统足够灵活以支持新的内容类型。爬虫可以通过插入新模块来扩展。图9-10显示了如何添加新模块。
As almost every system evolves, one of the design goals is to make the system flexible enough to support new content types. The crawler can be extended by plugging in new modules. Figure 9-10 shows how to add new modules.
•插入PNG Downloader 模块来下载PNG 文件。
•PNG Downloader module is plugged-in to download PNG files.
•添加Web Monitor 模块来监控Web 并防止版权和商标侵权。
•Web Monitor module is added to monitor the web and prevent copyright and trademark infringements.
本节讨论冗余、无意义或有害内容的检测和预防。
This section discusses the detection and prevention of redundant, meaningless, or harmful content.
1. 冗余内容
1. Redundant content
如前所述,近 30% 的网页是重复的。哈希值或校验和有助于检测重复[11]。
As discussed previously, nearly 30% of the web pages are duplicates. Hashes or checksums help to detect duplication [11].
2.蜘蛛陷阱
2. Spider traps
蜘蛛陷阱是导致爬虫陷入无限循环的网页。例如,无限深的目录结构如下:www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/...
A spider trap is a web page that causes a crawler in an infinite loop. For instance, an infinite deep directory structure is listed as follows: www.spidertrapexample.com/foo/bar/foo/bar/foo/bar/…
通过设置 URL 的最大长度可以避免此类蜘蛛陷阱。然而,不存在一种通用的解决方案来检测蜘蛛陷阱。包含蜘蛛陷阱的网站很容易识别,因为在此类网站上发现的网页数量异常多。开发自动算法来避免蜘蛛陷阱很困难;但是,用户可以手动验证和识别蜘蛛陷阱,并从爬虫程序中排除这些网站或应用一些自定义的 URL 过滤器。
Such spider traps can be avoided by setting a maximal length for URLs. However, no one-size-fits-all solution exists to detect spider traps. Websites containing spider traps are easy to identify due to an unusually large number of web pages discovered on such websites. It is hard to develop automatic algorithms to avoid spider traps; however, a user can manually verify and identify a spider trap, and either exclude those websites from the crawler or apply some customized URL filters.
3. 数据噪声
3. Data noise
有些内容几乎没有价值,例如广告、代码片段、垃圾邮件 URL 等。这些内容对爬虫没有用处,应尽可能排除。
Some of the contents have little or no value, such as advertisements, code snippets, spam URLs, etc. Those contents are not useful for crawlers and should be excluded if possible.
在本章中,我们首先讨论了一个好的爬虫的特征:可伸缩性、礼貌性、可扩展性和健壮性。然后,我们提出了设计并讨论了关键组件。构建可扩展的网络爬虫并不是一项简单的任务,因为网络非常大并且充满陷阱。尽管我们已经涵盖了许多主题,但我们仍然遗漏了许多相关的谈话要点:
In this chapter, we first discussed the characteristics of a good crawler: scalability, politeness, extensibility, and robustness. Then, we proposed a design and discussed key components. Building a scalable web crawler is not a trivial task because the web is enormously large and full of traps. Even though we have covered many topics, we still miss many relevant talking points:
•服务器端呈现:许多网站使用JavaScript、AJAX 等脚本来动态生成链接。如果我们直接下载并解析网页,我们将无法检索动态生成的链接。为了解决这个问题,我们在解析页面之前首先执行服务器端渲染(也称为动态渲染)[12]。
•Server-side rendering: Numerous websites use scripts like JavaScript, AJAX, etc to generate links on the fly. If we download and parse web pages directly, we will not be able to retrieve dynamically generated links. To solve this problem, we perform server-side rendering (also called dynamic rendering) first before parsing a page [12].
•过滤掉不需要的页面:由于存储容量和抓取资源有限,反垃圾邮件组件有利于过滤掉低质量和垃圾页面[13] [14]。
•Filter out unwanted pages: With finite storage capacity and crawl resources, an anti-spam component is beneficial in filtering out low quality and spam pages [13] [14].
•数据库复制和分片:复制和分片等技术用于提高数据层可用性、可扩展性和可靠性。
•Database replication and sharding: Techniques like replication and sharding are used to improve the data layer availability, scalability, and reliability.
•水平扩展:对于大规模爬行,需要数百甚至数千台服务器来执行下载任务。关键是保持服务器无状态。
•Horizontal scaling: For large scale crawl, hundreds or even thousands of servers are needed to perform download tasks. The key is to keep servers stateless.
•可用性、一致性和可靠性:这些概念是任何大型系统成功的核心。我们在第 1 章中详细讨论了这些概念。刷新您对这些主题的记忆。
•Availability, consistency, and reliability: These concepts are at the core of any large system’s success. We discussed these concepts in detail in Chapter 1. Refresh your memory on these topics.
•分析:收集和分析数据是任何系统的重要组成部分,因为数据是微调的关键要素。
•Analytics: Collecting and analyzing data are important parts of any system because data is key ingredient for fine-tuning.
恭喜您已经走到这一步了!现在拍拍自己的背吧。好工作!
Congratulations on getting this far! Now give yourself a pat on the back. Good job!
参考资料
Reference materials
[1] 美国国会图书馆:https://www.loc.gov/websites/
[1] US Library of Congress: https://www.loc.gov/websites/
[2] 欧盟网络档案:http://data.europa.eu/webarchive
[2] EU Web Archive: http://data.europa.eu/webarchive
[3] Digimarc:https://www.digimarc.com/products/digimarc-services/piracy-intelligence
[3] Digimarc: https://www.digimarc.com/products/digimarc-services/piracy-intelligence
[4] Heydon A.、Najork M. Mercator:可扩展、可扩展的网络爬虫万维网,2 (4) (1999),第 219-229 页
[4] Heydon A., Najork M. Mercator: A scalable, extensible web crawler World Wide Web, 2 (4) (1999), pp. 219-229
[5] 作者:Christopher Olston、Marc Najork:网络爬行。http://infolab.stanford.edu/~olston/publications/crawling_survey.pdf
[5] By Christopher Olston, Marc Najork: Web Crawling. http://infolab.stanford.edu/~olston/publications/crawling_survey.pdf
[6] 29% 的网站面临重复内容问题: https: //tinyurl.com/y6tmh55y
[6] 29% Of Sites Face Duplicate Content Issues: https://tinyurl.com/y6tmh55y
[7] 拉宾·莫等人。随机多项式指纹识别大学计算技术研究中心艾肯计算实验室 (1981)
[7] Rabin M.O., et al. Fingerprinting by random polynomials Center for Research in Computing Techn., Aiken Computation Laboratory, Univ. (1981)
[8] BH Bloom,“哈希编码中允许错误的空间/时间权衡”,
[8] B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,”
ACM 通讯,卷。13、没有。7,第 422–426 页,1970 年。
Communications of the ACM, vol. 13, no. 7, pp. 422–426, 1970.
[9] 唐纳德·J·帕特森 (Donald J. Patterson),《网络爬行》:
[9] Donald J. Patterson, Web Crawling:
https://www.ics.uci.edu/~lopes/teaching/cs221W12/slides/Lecture05.pdf
https://www.ics.uci.edu/~lopes/teaching/cs221W12/slides/Lecture05.pdf
[10] L. Page、S. Brin、R. Motwani 和 T. Winograd,“PageRank 引用
[10] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRank citation
排名:为网络带来秩序”,技术报告,斯坦福大学,
ranking: Bringing order to the web,” Technical Report, Stanford University,
1998.
1998.
[11] 伯顿·布鲁姆。哈希编码中的空间/时间权衡与允许的错误。ACM 通讯,13(7),第 422--426 页,1970 年 7 月。
[11] Burton Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7), pages 422--426, July 1970.
[12]谷歌动态渲染:https://developers.google.com/search/docs/guides/dynamic-rendering
[12] Google Dynamic Rendering: https://developers.google.com/search/docs/guides/dynamic-rendering
[13] T. Urvoy、T. Lavergne 和 P. Filoche,“通过隐藏风格相似性跟踪网络垃圾邮件”,载于第二届网络对抗性信息检索国际研讨会论文集,2006 年。
[13] T. Urvoy, T. Lavergne, and P. Filoche, “Tracking web spam with hidden style similarity,” in Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web, 2006.
[14] H.-T。Lee、D. Leonard、X. Wang 和 D. Loguinov,“IRLbot:扩展到 60 亿页及以上”,第 17 届国际万维网会议记录,2008 年。
[14] H.-T. Lee, D. Leonard, X. Wang, and D. Loguinov, “IRLbot: Scaling to 6 billion pages and beyond,” in Proceedings of the 17th International World Wide Web Conference, 2008.
近年来,通知系统已经成为许多应用程序中非常流行的功能。通知向用户发出重要信息的提醒,例如突发新闻、产品更新、活动、产品等。它已成为我们日常生活中不可或缺的一部分。在本章中,您需要设计一个通知系统。
A notification system has already become a very popular feature for many applications in recent years. A notification alerts a user with important information like breaking news, product updates, events, offerings, etc. It has become an indispensable part of our daily life. In this chapter, you are asked to design a notification system.
通知不仅仅是移动推送通知。三种通知格式为:移动推送通知、短信和电子邮件。图 10-1 显示了每个通知的示例。
A notification is more than just mobile push notification. Three types of notification formats are: mobile push notification, SMS message, and Email. Figure 10-1 shows an example of each of these notifications.
构建一个每天发送数百万条通知的可扩展系统并不是一件容易的事。它需要对通知生态系统有深入的了解。面试问题特意设计为开放式且模棱两可,您有责任提出问题以澄清要求。
Building a scalable system that sends out millions of notifications a day is not an easy task. It requires a deep understanding of the notification ecosystem. The interview question is purposely designed to be open-ended and ambiguous, and it is your responsibility to ask questions to clarify the requirements.
考生:系统支持哪些类型的通知?
Candidate: What types of notifications does the system support?
面试官:推送通知、短信、电子邮件。
Interviewer: Push notification, SMS message, and email.
考生:是实时系统吗?
Candidate: Is it a real-time system?
采访者:就说它是一个软实时系统吧。我们希望用户尽快收到通知。但是,如果系统工作负载较高,轻微的延迟是可以接受的。
Interviewer: Let us say it is a soft real-time system. We want a user to receive notifications as soon as possible. However, if the system is under a high workload, a slight delay is acceptable.
考生:支持哪些设备?
Candidate: What are the supported devices?
面试官:iOS 设备、Android 设备、笔记本电脑/台式机。
Interviewer: iOS devices, android devices, and laptop/desktop.
考生:什么会触发通知?
Candidate: What triggers notifications?
采访者:通知可以由客户端应用程序触发。它们也可以在服务器端进行安排。
Interviewer: Notifications can be triggered by client applications. They can also be scheduled on the server-side.
候选人:用户可以选择退出吗?
Candidate: Will users be able to opt-out?
采访者:是的,选择退出的用户将不再收到通知。
Interviewer: Yes, users who choose to opt-out will no longer receive notifications.
考生:每天发出多少条通知?
Candidate: How many notifications are sent out each day?
采访者:1000万条移动推送通知、100万条短信、500万封电子邮件。
Interviewer: 10 million mobile push notifications, 1 million SMS messages, and 5 million emails.
本节展示支持各种通知类型的高级设计:iOS 推送通知、Android 推送通知、短信和电子邮件。它的结构如下:
This section shows the high-level design that supports various notification types: iOS push notification, Android push notification, SMS message, and Email. It is structured as follows:
•不同类型的通知
•Different types of notifications
•联系信息收集流程
•Contact info gathering flow
•通知发送/接收流程
•Notification sending/receiving flow
我们首先看看每种通知类型在高层次上是如何工作的。
We start by looking at how each notification type works at a high level.
我们主要需要三个组件来发送 iOS 推送通知:
We primary need three components to send an iOS push notification:
•提供者。提供商构建通知请求并将其发送到 Apple 推送通知服务 (APNS)。为了构建推送通知,提供商提供以下数据:
•Provider. A provider builds and sends notification requests to Apple Push Notification Service (APNS). To construct a push notification, the provider provides the following data:
•设备令牌:这是用于发送推送通知的唯一标识符。
•Device token: This is a unique identifier used for sending push notifications.
•有效负载:这是包含通知有效负载的JSON 字典。这是一个例子:
•Payload: This is a JSON dictionary that contains a notification’s payload. Here is an example:
• APNS:这是Apple 提供的一项远程服务,用于将推送通知传播到iOS 设备。
•APNS: This is a remote service provided by Apple to propagate push notifications to iOS devices.
• iOS 设备:它是接收推送通知的最终客户端。
•iOS Device: It is the end client, which receives push notifications.
Android 采用类似的通知流程。Firebase 云消息传递 (FCM) 通常不使用 APN,而是用于向 Android 设备发送推送通知。
Android adopts a similar notification flow. Instead of using APNs, Firebase Cloud Messaging (FCM) is commonly used to send push notifications to android devices.
对于 SMS 消息,通常使用 Twilio [1]、Nexmo [2] 等第三方 SMS 服务。其中大部分是商业服务。
For SMS messages, third party SMS services like Twilio [1], Nexmo [2], and many others are commonly used. Most of them are commercial services.
尽管公司可以建立自己的电子邮件服务器,但许多公司选择商业电子邮件服务。Sendgrid [3] 和 Mailchimp [4] 是最受欢迎的电子邮件服务之一,它们提供更好的投递率和数据分析。
Although companies can set up their own email servers, many of them opt for commercial email services. Sendgrid [3] and Mailchimp [4] are among the most popular email services, which offer a better delivery rate and data analytics.
图10-6显示了包含所有第三方服务后的设计。
Figure 10-6 shows the design after including all the third-party services.
要发送通知,我们需要收集移动设备令牌、电话号码或电子邮件地址。如图10-7所示,当用户安装我们的应用程序或第一次注册时,API服务器会收集用户联系信息并将其存储在数据库中。
To send notifications, we need to gather mobile device tokens, phone numbers, or email addresses. As shown in Figure 10-7, when a user installs our app or signs up for the first time, API servers collect user contact info and store it in the database.
图 10-8 显示了用于存储联系信息的简化数据库表。电子邮件地址和电话号码存储在用户表中,而设备令牌存储在设备表中。一个用户可以拥有多个设备,表示可以向所有用户设备发送推送通知。
Figure 10-8 shows simplified database tables to store contact info. Email addresses and phone numbers are stored in the user table, whereas device tokens are stored in the device table. A user can have multiple devices, indicating that a push notification can be sent to all the user devices.
我们首先展示初步设计;然后,提出一些优化建议。
We will first present the initial design; then, propose some optimizations.
高层设计_
High-level design
图 10-9 显示了该设计,每个系统组件的说明如下。
Figure 10-9 shows the design, and each system component is explained below.
服务1到N :服务可以是微服务、cron作业或触发通知发送事件的分布式系统。例如,计费服务会发送电子邮件提醒客户到期付款,或者购物网站会通过短信告诉客户他们的包裹将于明天送达。
Service 1 to N: A service can be a micro-service, a cron job, or a distributed system that triggers notification sending events. For example, a billing service sends emails to remind customers of their due payment or a shopping website tells customers that their packages will be delivered tomorrow via SMS messages.
通知系统:通知系统是发送/接收通知的核心。从简单的事情开始,只使用一个通知服务器。它为服务 1 到 N 提供 API,并为第三方服务构建通知负载。
Notification system: The notification system is the centerpiece of sending/receiving notifications. Starting with something simple, only one notification server is used. It provides APIs for services 1 to N, and builds notification payloads for third party services.
第三方服务:第三方服务负责向用户发送通知。在与第三方服务集成时,我们需要格外注意可扩展性。良好的可扩展性意味着系统灵活,可以轻松地插拔第三方服务。另一个重要的考虑因素是第三方服务可能在新市场或未来不可用。例如,FCM 在中国不可用。因此,这里使用了替代的第三方服务,例如 Jpush、PushY 等。
Third-party services: Third party services are responsible for delivering notifications to users. While integrating with third-party services, we need to pay extra attention to extensibility. Good extensibility means a flexible system that can easily plugging or unplugging of a third-party service. Another important consideration is that a third-party service might be unavailable in new markets or in the future. For instance, FCM is unavailable in China. Thus, alternative third-party services such as Jpush, PushY, etc are used there.
iOS、Android、短信、电子邮件:用户在其设备上接收通知。
iOS, Android, SMS, Email: Users receive notifications on their devices.
该设计中发现了三个问题:
Three problems are identified in this design:
•单点故障(SPOF):单个通知服务器意味着SPOF。
•Single point of failure (SPOF): A single notification server means SPOF.
•难以扩展:通知系统在一台服务器中处理与推送通知相关的所有事务。独立扩展数据库、缓存和不同的通知处理组件具有挑战性。
•Hard to scale: The notification system handles everything related to push notifications in one server. It is challenging to scale databases, caches, and different notification processing components independently.
•性能瓶颈:处理和发送通知可能会占用大量资源。例如,构建 HTML 页面并等待第三方服务的响应可能需要时间。在一个系统中处理所有事情可能会导致系统过载,尤其是在高峰时段。
•Performance bottleneck: Processing and sending notifications can be resource intensive. For example, constructing HTML pages and waiting for responses from third party services could take time. Handling everything in one system can result in the system overload, especially during peak hours.
高层设计(改进)
High-level design (improved)
在列举了最初设计中的挑战后,我们改进了设计,如下所示:
After enumerating challenges in the initial design, we improve the design as listed below:
•将数据库和缓存移出通知服务器。
•Move the database and cache out of the notification server.
•添加更多通知服务器并设置自动水平缩放。
•Add more notification servers and set up automatic horizontal scaling.
•引入消息队列来解耦系统组件。
•Introduce message queues to decouple the system components.
图10-10显示了改进后的高层设计。
Figure 10-10 shows the improved high-level design.
浏览上图的最佳方式是从左到右:
The best way to go through the above diagram is from left to right:
服务1到N :它们代表通过通知服务器提供的API发送通知的不同服务。
Service 1 to N: They represent different services that send notifications via APIs provided by notification servers.
通知服务器:它们提供以下功能:
Notification servers: They provide the following functionalities:
•为服务提供API 以发送通知。这些 API 只能在内部访问或由经过验证的客户端访问,以防止垃圾邮件。
•Provide APIs for services to send notifications. Those APIs are only accessible internally or by verified clients to prevent spams.
•执行基本验证以验证电子邮件、电话号码等。
•Carry out basic validations to verify emails, phone numbers, etc.
•查询数据库或缓存以获取呈现通知所需的数据。
•Query the database or cache to fetch data needed to render a notification.
•将通知数据放入消息队列以进行并行处理。
•Put notification data to message queues for parallel processing.
以下是发送电子邮件的 API 示例:
Here is an example of the API to send an email:
发布https://api.example.com/v/sms/send
POST https://api.example.com/v/sms/send
请求正文
Request body
缓存:缓存用户信息、设备信息、通知模板。
Cache: User info, device info, notification templates are cached.
DB :它存储有关用户、通知、设置等的数据。
DB: It stores data about user, notification, settings, etc.
消息队列:它们消除组件之间的依赖关系。当要发送大量通知时,消息队列充当缓冲区。每种通知类型都分配有不同的消息队列,因此第三方服务的中断不会影响其他通知类型。
Message queues: They remove dependencies between components. Message queues serve as buffers when high volumes of notifications are to be sent out. Each notification type is assigned with a distinct message queue so an outage in one third-party service will not affect other notification types.
Workers :Workers 是从消息队列中拉取通知事件并将其发送到相应第三方服务的服务器列表。
Workers: Workers are a list of servers that pull notification events from message queues and send them to the corresponding third-party services.
第三方服务:在最初的设计中已经解释过。
Third-party services: Already explained in the initial design.
iOS、Android、短信、电子邮件:已在初始设计中进行了解释。
iOS, Android, SMS, Email: Already explained in the initial design.
接下来,让我们检查一下每个组件如何协同工作来发送通知:
Next, let us examine how every component works together to send a notification:
1. 服务调用通知服务器提供的API发送通知。
1. A service calls APIs provided by notification servers to send notifications.
2. 通知服务器从缓存或数据库中获取元数据,例如用户信息、设备令牌和通知设置。
2. Notification servers fetch metadata such as user info, device token, and notification setting from the cache or database.
3. 将通知事件发送到相应的队列进行处理。例如,iOS 推送通知事件被发送到 iOS PN 队列。
3. A notification event is sent to the corresponding queue for processing. For instance, an iOS push notification event is sent to the iOS PN queue.
4. Workers 从消息队列中拉取通知事件。
4. Workers pull notification events from message queues.
5. 工作人员向第三方服务发送通知。
5. Workers send notifications to third party services.
在高层设计中,我们讨论了不同类型的通知、联系信息收集流程以及通知发送/接收流程。我们将深入探讨以下内容:
In the high-level design, we discussed different types of notifications, contact info gathering flow, and notification sending/receiving flow. We will explore the following in deep dive:
•可靠性。
•Reliability.
•其他组件和注意事项:通知模板、通知设置、速率限制、重试机制、推送通知的安全性、监控排队通知和事件跟踪。
•Additional component and considerations: notification template, notification settings, rate limiting, retry mechanism, security in push notifications, monitor queued notifications and event tracking.
•更新设计。
•Updated design.
在分布式环境中设计通知系统时,我们必须回答一些重要的可靠性问题。
We must answer a few important reliability questions when designing a notification system in distributed environments.
如何防止数据丢失?
How to prevent data loss?
通知系统最重要的要求之一是不能丢失数据。通知通常可以延迟或重新排序,但永远不会丢失。为了满足这个需求,通知系统将通知数据保存在数据库中并实现重试机制。通知日志数据库用于数据持久化,如图10-11所示。
One of the most important requirements in a notification system is that it cannot lose data. Notifications can usually be delayed or re-ordered, but never lost. To satisfy this requirement, the notification system persists notification data in a database and implements a retry mechanism. The notification log database is included for data persistence, as shown in Figure 10-11.
收件人只会收到一次通知吗?
Will recipients receive a notification exactly once?
最简洁的答案是不。尽管通知在大多数情况下只传递一次,但分布式特性可能会导致重复通知。为了减少重复的发生,我们引入了重复数据删除机制,并仔细处理每个故障案例。这是一个简单的重复数据删除逻辑:
The short answer is no. Although notification is delivered exactly once most of the time, the distributed nature could result in duplicate notifications. To reduce the duplication occurrence, we introduce a dedupe mechanism and handle each failure case carefully. Here is a simple dedupe logic:
当通知事件首次到达时,我们通过检查事件 ID 来检查之前是否见过该事件。如果以前见过,则将其丢弃。否则,我们将发出通知。有兴趣的读者可以参考参考资料[5]来探究为什么我们不能做到一次性交付。
When a notification event first arrives, we check if it is seen before by checking the event ID. If it is seen before, it is discarded. Otherwise, we will send out the notification. For interested readers to explore why we cannot have exactly once delivery, refer to the reference material [5].
我们已经讨论了如何收集用户联系信息、发送和接收通知。通知系统远不止于此。在这里我们讨论其他组件,包括模板重用、通知设置、事件跟踪、系统监控、速率限制等。
We have discussed how to collect user contact info, send, and receive a notification. A notification system is a lot more than that. Here we discuss additional components including template reusing, notification settings, event tracking, system monitoring, rate limiting, etc.
大型通知系统每天会发出数百万条通知,其中许多通知都遵循类似的格式。引入通知模板是为了避免从头开始构建每个通知。通知模板是预先格式化的通知,可通过自定义参数、样式、跟踪链接等来创建您独特的通知。以下是推送通知的示例模板。
A large notification system sends out millions of notifications per day, and many of these notifications follow a similar format. Notification templates are introduced to avoid building every notification from scratch. A notification template is a preformatted notification to create your unique notification by customizing parameters, styling, tracking links, etc. Here is an example template of push notifications.
身体:
BODY:
你梦见了它。我们敢这么做。[项目名称] 已回归 — 截止日期为 [日期]。
You dreamed of it. We dared it. [ITEM NAME] is back — only until [DATE].
行动管制令:
CTA:
现在下单。或者,保存我的[项目名称]
Order Now. Or, Save My [ITEM NAME]
使用通知模板的好处包括保持格式一致、减少边距误差和节省时间。
The benefits of using notification templates include maintaining a consistent format, reducing the margin error, and saving time.
用户通常每天都会收到太多的通知,他们很容易感到不知所措。因此,许多网站和应用程序为用户提供了对通知设置的细粒度控制。该信息存储在通知设置表中,具有以下字段:
Users generally receive way too many notifications daily and they can easily feel overwhelmed. Thus, many websites and apps give users fine-grained control over notification settings. This information is stored in the notification setting table, with the following fields:
用户 ID bigInt
user_id bigInt
Channel varchar # 推送通知、电子邮件或短信
channel varchar # push notification, email or SMS
opt_in boolean # 选择接收通知
opt_in boolean # opt-in to receive notification
在向用户发送任何通知之前,我们首先检查用户是否选择接收此类通知。
Before any notification is sent to a user, we first check if a user is opted-in to receive this type of notification.
为了避免过多的通知让用户感到不知所措,我们可以限制用户可以接收的通知数量。这很重要,因为如果我们发送得太频繁,接收者可能会完全关闭通知。
To avoid overwhelming users with too many notifications, we can limit the number of notifications a user can receive. This is important because receivers could turn off notifications completely if we send too often.
当第三方服务发送通知失败时,该通知会被添加到消息队列中进行重试。如果问题仍然存在,则会向开发人员发送警报。
When a third-party service fails to send a notification, the notification will be added to the message queue for retrying. If the problem persists, an alert will be sent out to developers.
对于 iOS 或 Android 应用程序,appKey 和 appSecret 用于保护推送通知 API [6]。仅允许经过身份验证或验证的客户端使用我们的 API 发送推送通知。有兴趣的用户可以参考参考资料[6]。
For iOS or Android apps, appKey and appSecret are used to secure push notification APIs [6]. Only authenticated or verified clients are allowed to send push notifications using our APIs. Interested users should refer to the reference material [6].
要监控的一个关键指标是排队通知的总数。如果数量很大,则工作人员处理通知事件的速度不够快。为了避免通知发送延迟,需要更多的工作人员。图 10-12(出自 [7])显示了要处理的排队消息的示例。
A key metric to monitor is the total number of queued notifications. If the number is large, the notification events are not processed fast enough by workers. To avoid delay in the notification delivery, more workers are needed. Figure 10-12 (credit to [7]) shows an example of queued messages to be processed.
图10-12
Figure 10-12
通知指标(例如打开率、点击率和参与度)对于了解客户行为非常重要。分析服务实现事件跟踪。通常需要通知系统和分析服务之间的集成。图 10-13 显示了可以出于分析目的进行跟踪的事件示例。
Notification metrics, such as open rate, click rate, and engagement are important in understanding customer behaviors. Analytics service implements events tracking. Integration between the notification system and the analytics service is usually required. Figure 10-13 shows an example of events that might be tracked for analytics purposes.
将所有内容放在一起,图 10-14 显示了更新后的通知系统设计。
Putting everything together, Figure 10-14 shows the updated notification system design.
在本次设计中,与之前的设计相比,增加了很多新的部件。
In this design, many new components are added in comparison with the previous design.
•通知服务器配备了两个更关键的功能:身份验证和速率限制。
•The notification servers are equipped with two more critical features: authentication and rate-limiting.
•我们还添加了重试机制来处理通知失败。如果系统无法发送通知,它们将被放回消息队列中,并且工作人员将重试预定义的次数。
•We also add a retry mechanism to handle notification failures. If the system fails to send notifications, they are put back in the messaging queue and the workers will retry for a predefined number of times.
•此外,通知模板提供一致且高效的通知创建过程。
•Furthermore, notification templates provide a consistent and efficient notification creation process.
•Finally, monitoring and tracking systems are added for system health checks and future improvements.
通知是必不可少的,因为它们让我们随时了解重要信息。它可以是有关 Netflix 上您最喜欢的电影的推送通知、有关新产品折扣的电子邮件或有关在线购物付款确认的消息。
Notifications are indispensable because they keep us posted with important information. It could be a push notification about your favorite movie on Netflix, an email about discounts on new products, or a message about your online shopping payment confirmation.
在本章中,我们描述了支持多种通知格式的可扩展通知系统的设计:推送通知、短信和电子邮件。我们采用消息队列来解耦系统组件。
In this chapter, we described the design of a scalable notification system that supports multiple notification formats: push notification, SMS message, and email. We adopted message queues to decouple system components.
除了高层设计之外,我们还深入挖掘了更多组件和优化。
Besides the high-level design, we dug deep into more components and optimizations.
•可靠性:我们提出了一种稳健的重试机制,以最大限度地降低故障率。
•Reliability: We proposed a robust retry mechanism to minimize the failure rate.
•安全性:AppKey/appSecret 对用于确保只有经过验证的客户端才能发送通知。
•Security: AppKey/appSecret pair is used to ensure only verified clients can send notifications.
•跟踪和监控:这些在通知流的任何阶段实施,以捕获重要统计数据。
•Tracking and monitoring: These are implemented in any stage of a notification flow to capture important stats.
•尊重用户设置:用户可以选择不接收通知。我们的系统在发送通知之前首先检查用户设置。
•Respect user settings: Users may opt-out of receiving notifications. Our system checks user settings first before sending notifications.
•速率限制:用户会喜欢对其收到的通知数量设置频率上限。
•Rate limiting: Users will appreciate a frequency capping on the number of notifications they receive.
恭喜您已经走到这一步了!现在拍拍自己的背吧。好工作!
Congratulations on getting this far! Now give yourself a pat on the back. Good job!
参考资料
Reference materials
[1] Twilio 短信: https: //www.twilio.com/sms
[1] Twilio SMS: https://www.twilio.com/sms
[2] Nexmo 短信:https://www.nexmo.com/products/sms
[2] Nexmo SMS: https://www.nexmo.com/products/sms
[3]发送网格: https: //sendgrid.com/
[3] Sendgrid: https://sendgrid.com/
[4] Mailchimp:https://mailchimp.com/
[4] Mailchimp: https://mailchimp.com/
[5] 你不能进行一次性交付:https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/
[5] You Cannot Have Exactly-Once Delivery: https://bravenewgeek.com/you-cannot-have-exactly-once-delivery/
[6] 推送通知中的安全性:https://cloud.ibm.com/docs/services/mobilepush ?topic=mobile-pushnotification-security-in-push-notifications
[6] Security in Push Notifications: https://cloud.ibm.com/docs/services/mobilepush?topic=mobile-pushnotification-security-in-push-notifications
[7]RadditMQ: https: //bit.ly/2sotIa6
[7] RadditMQ: https://bit.ly/2sotIa6
在本章中,您需要设计一个新闻源系统。什么是新闻提要?根据 Facebook 帮助页面,“新闻源是主页中间不断更新的故事列表。动态消息包括状态更新、照片、视频、链接、应用程序活动以及您在 Facebook 上关注的人员、页面和群组的点赞”[1]。这是一个流行的面试问题。常见的类似问题有:设计 Facebook 新闻源、Instagram 源、Twitter 时间线等。
In this chapter, you are asked to design a news feed system. What is news feed? According to the Facebook help page, “News feed is the constantly updating list of stories in the middle of your home page. News Feed includes status updates, photos, videos, links, app activity, and likes from people, pages, and groups that you follow on Facebook” [1]. This is a popular interview question. Similar questions commonly asked are: design Facebook news feed, Instagram feed, Twitter timeline, etc.
第一组澄清问题是为了了解面试官要求您设计新闻源系统时的想法。至少,您应该弄清楚要支持哪些功能。以下是候选人与面试官互动的示例:
The first set of clarification questions are to understand what the interviewer has in mind when she asks you to design a news feed system. At the very least, you should figure out what features to support. Here is an example of candidate-interviewer interaction:
候选人:这是一个移动应用程序吗?或者网络应用程序?或两者?
Candidate: Is this a mobile app? Or a web app? Or both?
面试官:两者都有
Interviewer: Both
候选人:有哪些重要特征?
Candidate: What are the important features?
采访:用户可以在新闻提要页面上发布帖子并查看朋友的帖子。
Interview: A user can publish a post and see her friends’ posts on the news feed page.
候选人:新闻提要是按时间倒序排序还是按照主题分数等特定顺序排序?例如,来自您亲密朋友的帖子得分更高。
Candidate: Is the news feed sorted by reverse chronological order or any particular order such as topic scores? For instance, posts from your close friends have higher scores.
采访者:为了简单起见,我们假设 feed 是按时间倒序排列的。
Interviewer: To keep things simple, let us assume the feed is sorted by reverse chronological order.
候选人:一个用户可以有多少个朋友?
Candidate: How many friends can a user have?
面试人数:5000
Interviewer: 5000
应聘者:交通量是多少?
Candidate: What is the traffic volume?
采访者:1000万日活跃用户
Interviewer: 10 million DAU
候选人:提要可以包含图像、视频还是仅包含文本?
Candidate: Can feed contain images, videos, or just text?
面试官:它可以包含媒体文件,包括图像和视频。
Interviewer: It can contain media files, including both images and videos.
现在您已经收集了需求,我们专注于设计系统。
Now you have gathered the requirements, we focus on designing the system.
设计分为两个流程:Feed 发布和 News Feed 构建。
The design is divided into two flows: feed publishing and news feed building.
• Feed 发布:当用户发布帖子时,相应的数据会写入缓存和数据库中。她朋友的新闻源中已填充了一条帖子。
•Feed publishing: when a user publishes a post, corresponding data is written into cache and database. A post is populated to her friends’ news feed.
•新闻源构建:为简单起见,我们假设新闻源是通过按时间倒序聚合朋友的帖子来构建的。
•Newsfeed building: for simplicity, let us assume the news feed is built by aggregating friends’ posts in reverse chronological order.
新闻源 API 是客户端与服务器通信的主要方式。这些 API 是基于 HTTP 的,允许客户端执行操作,包括发布状态、检索新闻提要、添加朋友等。我们讨论两个最重要的 API:提要发布 API 和新闻提要检索 API。
The news feed APIs are the primary ways for clients to communicate with servers. Those APIs are HTTP based that allow clients to perform actions, which include posting a status, retrieving news feed, adding friends, etc. We discuss two most important APIs: feed publishing API and news feed retrieval API.
提要发布API
Feed publishing API
要发布帖子,将向服务器发送 HTTP POST 请求。API如下图所示:
To publish a post, a HTTP POST request will be sent to the server. The API is shown below:
发布 /v1/me/feed
POST /v1/me/feed
参数:
Params:
•内容:内容是帖子的文本。
•content: content is the text of the post.
• auth_token:用于验证API 请求。
•auth_token: it is used to authenticate API requests.
新闻源检索 API
Newsfeed retrieval API
检索新闻提要的 API 如下所示:
The API to retrieve news feed is shown below:
获取 /v1/me/feed
GET /v1/me/feed
参数:
Params:
• auth_token:用于验证API 请求。
•auth_token: it is used to authenticate API requests.
图 11-2 显示了 feed 发布流程的高级设计。
Figure 11-2 shows the high-level design of the feed publishing flow.
•用户:用户可以在浏览器或移动应用程序上查看新闻源。用户通过 API 发布内容为“Hello”的帖子:
•User: a user can view news feeds on a browser or mobile app. A user makes a post with content “Hello” through API:
/v1/me/feed?content=Hello&auth_token={auth_token}
/v1/me/feed?content=Hello&auth_token={auth_token}
•负载平衡器:将流量分配到Web 服务器。
•Load balancer: distribute traffic to web servers.
• Web 服务器:Web 服务器将流量重定向到不同的内部服务。
•Web servers: web servers redirect traffic to different internal services.
• Post 服务:将post 保留在数据库和缓存中。
•Post service: persist post in the database and cache.
•扇出服务:将新内容推送到朋友的新闻源。新闻源数据存储在缓存中以便快速检索。
•Fanout service: push new content to friends’ news feed. Newsfeed data is stored in the cache for fast retrieval.
•Notification service: inform friends that new content is available and send out push notifications.
在本节中,我们将讨论如何在幕后构建新闻源。图 11-3 显示了高级设计:
In this section, we discuss how news feed is built behind the scenes. Figure 11-3 shows the high-level design:
•用户:用户发送检索其新闻提要的请求。请求看起来像这样:/ v1/me/feed。
•User: a user sends a request to retrieve her news feed. The request looks like this: /v1/me/feed.
•负载均衡器:负载均衡器将流量重定向到Web 服务器。
•Load balancer: load balancer redirects traffic to web servers.
• Web 服务器:Web 服务器将请求路由到新闻源服务。
•Web servers: web servers route requests to newsfeed service.
•新闻源服务:新闻源服务从缓存中获取新闻源。
•Newsfeed service: news feed service fetches news feed from the cache.
•Newsfeed cache: store news feed IDs needed to render the news feed.
高层设计简要涵盖了两个流程:提要发布和新闻提要构建。在这里,我们更深入地讨论这些主题。
The high-level design briefly covered two flows: feed publishing and news feed building. Here, we discuss those topics in more depth.
图 11-4 概述了 feed 发布的详细设计。我们已经讨论了高层设计中的大部分组件,我们将重点关注两个组件:Web 服务器和扇出服务。
Figure 11-4 outlines the detailed design for feed publishing. We have discussed most of components in high-level design, and we will focus on two components: web servers and fanout service.
除了与客户端通信之外,Web 服务器还强制执行身份验证和速率限制。只有使用有效auth_token登录的用户才可以发帖。该系统限制用户在一定时间内可以发布的帖子数量,这对于防止垃圾邮件和滥用内容至关重要。
Besides communicating with clients, web servers enforce authentication and rate-limiting. Only users signed in with valid auth_token are allowed to make posts. The system limits the number of posts a user can make within a certain period, vital to prevent spam and abusive content.
扇出是将帖子发送给所有朋友的过程。两种类型的扇出模型是:写入扇出(也称为推模型)和读取扇出(也称为拉模型)。两种模型都有优点和缺点。我们解释他们的工作流程并探索支持我们系统的最佳方法。
Fanout is the process of delivering a post to all friends. Two types of fanout models are: fanout on write (also called push model) and fanout on read (also called pull model). Both models have pros and cons. We explain their workflows and explore the best approach to support our system.
写入时扇出。通过这种方法,新闻源是在写入时预先计算的。新帖子发布后会立即发送到好友的缓存中。
Fanout on write. With this approach, news feed is pre-computed during write time. A new post is delivered to friends’ cache immediately after it is published.
优点:
Pros:
•动态消息实时生成,可以立即推送给好友。
•The news feed is generated in real-time and can be pushed to friends immediately.
•获取新闻源速度很快,因为新闻源是在写入时预先计算的。
•Fetching news feed is fast because the news feed is pre-computed during write time.
缺点:
Cons:
•如果用户有很多朋友,则获取所有朋友的朋友列表并生成新闻源会很慢且耗时。这就是所谓的热键问题。
•If a user has many friends, fetching the friend list and generating news feeds for all of them are slow and time consuming. It is called hotkey problem.
•对于不活跃或很少登录的用户,预先计算新闻源会浪费计算资源。
•For inactive users or those rarely log in, pre-computing news feeds waste computing resources.
读取时扇出。新闻源是在阅读期间生成的。这是一个按需模型。当用户加载其主页时,最近的帖子将被拉出。
Fanout on read. The news feed is generated during read time. This is an on-demand model. Recent posts are pulled when a user loads her home page.
优点:
Pros:
•对于不活跃的用户或很少登录的用户,读取时扇出效果更好,因为它不会浪费他们的计算资源。
•For inactive users or those who rarely log in, fanout on read works better because it will not waste computing resources on them.
•数据不会推送给好友,因此不存在热键问题。
•Data is not pushed to friends so there is no hotkey problem.
缺点:
Cons:
•由于新闻提要不是预先计算的,因此获取新闻提要很慢。
•Fetching the news feed is slow as the news feed is not pre-computed.
我们采用混合方法来获得这两种方法的优点并避免其中的陷阱。由于快速获取新闻源至关重要,因此我们对大多数用户使用推送模型。对于名人或拥有许多朋友/关注者的用户,我们让关注者按需拉取新闻内容,以避免系统过载。一致散列是缓解热键问题的有用技术,因为它有助于更均匀地分配请求/数据。
We adopt a hybrid approach to get benefits of both approaches and avoid pitfalls in them. Since fetching the news feed fast is crucial, we use a push model for the majority of users. For celebrities or users who have many friends/followers, we let followers pull news content on-demand to avoid system overload. Consistent hashing is a useful technique to mitigate the hotkey problem as it helps to distribute requests/data more evenly.
让我们仔细看看如图 11-5 所示的扇出服务。
Let us take a close look at the fanout service as shown in Figure 11-5.
扇出服务的工作原理如下:
The fanout service works as follows:
1. 从图数据库中获取好友ID。图数据库适合管理好友关系和好友推荐。希望了解更多有关此概念的感兴趣的读者应参阅参考资料 [2]。
1. Fetch friend IDs from the graph database. Graph databases are suited for managing friend relationship and friend recommendations. Interested readers wishing to learn more about this concept should refer to the reference material [2].
2. 从用户缓存中获取好友信息。然后系统根据用户设置过滤掉好友。例如,如果您将某人静音,即使你们仍然是朋友,她的帖子也不会显示在您的新闻源中。帖子可能不显示的另一个原因是用户可以有选择地与特定朋友共享信息或向其他人隐藏信息。
2. Get friends info from the user cache. The system then filters out friends based on user settings. For example, if you mute someone, her posts will not show up on your news feed even though you are still friends. Another reason why posts may not show is that a user could selectively share information with specific friends or hide it from other people.
3. 将好友列表和新帖子ID发送到消息队列。
3. Send friends list and new post ID to the message queue.
4. Fanout Worker 从消息队列中获取数据并将新闻源数据存储在新闻源缓存中。您可以将新闻提要缓存视为一个<post_id, user_id>映射表。每当有新帖子发布时,它就会被添加到新闻源表中,如图 11-6 所示。如果我们将整个 user 和 post 对象存储在缓存中,内存消耗会变得非常大。因此,仅存储 ID。为了保持较小的内存大小,我们设置了一个可配置的限制。用户滚动浏览新闻源中数千个帖子的机会很小。大多数用户只对最新内容感兴趣,因此缓存未命中率较低。
4. Fanout workers fetch data from the message queue and store news feed data in the news feed cache. You can think of the news feed cache as a <post_id, user_id> mapping table. Whenever a new post is made, it will be appended to the news feed table as shown in Figure 11-6. The memory consumption can become very large if we store the entire user and post objects in the cache. Thus, only IDs are stored. To keep the memory size small, we set a configurable limit. The chance of a user scrolling through thousands of posts in news feed is slim. Most users are only interested in the latest content, so the cache miss rate is low.
5. 将<post_id, user_id >存储在新闻源缓存中。图 11-6 显示了新闻源在缓存中的样子的示例。
5. Store <post_id, user_id > in news feed cache. Figure 11-6 shows an example of what the news feed looks like in cache.
图 11-7 说明了新闻提要检索的详细设计。
Figure 11-7 illustrates the detailed design for news feed retrieval.
如图11-7所示,媒体内容(图片、视频等)存储在CDN中,以便快速检索。让我们看看客户端如何检索新闻源。
As shown in Figure 11-7, media content (images, videos, etc.) are stored in CDN for fast retrieval. Let us look at how a client retrieves news feed.
1. 用户发送检索其新闻源的请求。该请求如下所示:/v1/me/feed
1. A user sends a request to retrieve her news feed. The request looks like this: /v1/me/feed
2. 负载均衡器将请求重新分配到Web 服务器。
2. The load balancer redistributes requests to web servers.
3. Web 服务器调用新闻源服务来获取新闻源。
3. Web servers call the news feed service to fetch news feeds.
4. 新闻提要服务从新闻提要缓存中获取列表帖子 ID。
4. News feed service gets a list post IDs from the news feed cache.
5. 用户的新闻提要不仅仅是提要 ID 列表。它包含用户名、个人资料图片、帖子内容、帖子图像等。因此,新闻源服务从缓存(用户缓存和帖子缓存)中获取完整的用户和帖子对象,以构建完全水合的新闻源。
5. A user’s news feed is more than just a list of feed IDs. It contains username, profile picture, post content, post image, etc. Thus, the news feed service fetches the complete user and post objects from caches (user cache and post cache) to construct the fully hydrated news feed.
6. 完全水合的新闻源以 JSON 格式返回给客户端进行渲染。
6. The fully hydrated news feed is returned in JSON format back to the client for rendering.
缓存对于新闻源系统极其重要。我们将缓存层分为 5 层,如图 11-8 所示。
Cache is extremely important for a news feed system. We divide the cache tier into 5 layers as shown in Figure 11-8.
•新闻源:它存储新闻源的ID。
•News Feed: It stores IDs of news feeds.
•内容:它存储每个帖子的数据。热门内容存储在热缓存中。
•Content: It stores every post data. Popular content is stored in hot cache.
•社交图谱:存储用户关系数据。
•Social Graph: It stores user relationship data.
•操作:它存储有关用户是否喜欢帖子、回复帖子或对帖子执行其他操作的信息。
•Action: It stores info about whether a user liked a post, replied a post, or took other actions on a post.
•计数器:它存储点赞、回复、关注者、关注等计数器。
•Counters: It stores counters for like, reply, follower, following, etc.
在本章中,我们设计了一个新闻推送系统。我们的设计包含两个流程:提要发布和新闻提要检索。
In this chapter, we designed a news feed system. Our design contains two flows: feed publishing and news feed retrieval.
与任何系统设计面试问题一样,没有完美的系统设计方法。每家公司都有其独特的限制,您必须设计一个系统来适应这些限制。了解设计和技术选择的权衡非常重要。如果还剩几分钟,您可以讨论可扩展性问题。为了避免重复讨论,下面仅列出了高层谈话要点。
Like any system design interview questions, there is no perfect way to design a system. Every company has its unique constraints, and you must design a system to fit those constraints. Understanding the tradeoffs of your design and technology choices are important. If there are a few minutes left, you can talk about scalability issues. To avoid duplicated discussion, only high-level talking points are listed below.
扩展数据库:
Scaling the database:
•垂直缩放与水平缩放
•Vertical scaling vs Horizontal scaling
• SQL 与 NoSQL
•SQL vs NoSQL
•主从复制
•Master-slave replication
•读取副本
•Read replicas
•一致性模型
•Consistency models
•数据库分片
•Database sharding
其他谈话要点:
Other talking points:
•保持Web 层无状态
•Keep web tier stateless
•尽可能多地缓存数据
•Cache data as much as you can
•支持多个数据中心
•Support multiple data centers
•失去消息队列的耦合组件
•Lose couple components with message queues
•监控关键指标。例如,高峰时段的 QPS 和用户刷新新闻源时的延迟都是值得监控的。
•Monitor key metrics. For instance, QPS during peak hours and latency while users refreshing their news feed are interesting to monitor.
恭喜您已经走到这一步了!现在拍拍自己的背吧。好工作!
Congratulations on getting this far! Now give yourself a pat on the back. Good job!
参考资料
Reference materials
[1] 动态消息如何运作:
[1] How News Feed Works:
https://www.facebook.com/help/327131014036297/
https://www.facebook.com/help/327131014036297/
[2] 朋友的朋友推荐Neo4j和SQL Sever:
[2] Friend of Friend recommendations Neo4j and SQL Sever:
在本章中,我们将探讨聊天系统的设计。几乎每个人都使用聊天应用程序。图 12-1 显示了市场上一些最流行的应用程序。
In this chapter we explore the design of a chat system. Almost everyone uses a chat app. Figure 12-1 shows some of the most popular apps in the marketplace.
聊天应用程序为不同的人执行不同的功能。确定确切的要求非常重要。例如,当面试官考虑的是一对一聊天时,您不想设计一个专注于群聊的系统。探索功能需求很重要。
A chat app performs different functions for different people. It is extremely important to nail down the exact requirements. For example, you do not want to design a system that focuses on group chat when the interviewer has one-on-one chat in mind. It is important to explore the feature requirements.
就聊天应用程序的设计类型达成一致至关重要。市场上有 Facebook Messenger、微信、WhatsApp 等一对一聊天应用,也有 Slack 等专注于群聊的办公室聊天应用,还有 Discord 等专注于大群体互动和低语音的游戏聊天应用聊天延迟。
It is vital to agree on the type of chat app to design. In the marketplace, there are one-on-one chat apps like Facebook Messenger, WeChat, and WhatsApp, office chat apps that focus on group chat like Slack, or game chat apps, like Discord, that focus on large group interaction and low voice chat latency.
第一组澄清问题应该明确面试官要求你设计聊天系统时的具体想法。至少,弄清楚您是否应该专注于一对一聊天或群聊应用程序。您可能会问的一些问题如下:
The first set of clarification questions should nail down what the interviewer has in mind exactly when she asks you to design a chat system. At the very least, figure out if you should focus on a one-on-one chat or group chat app. Some questions you might ask are as follows:
应聘者:我们要设计什么样的聊天应用程序?1对1还是以小组为基础?
Candidate: What kind of chat app shall we design? 1 on 1 or group based?
面试官:应该同时支持1对1和群聊。
Interviewer: It should support both 1 on 1 and group chat.
候选人:这是一个移动应用程序吗?或者网络应用程序?或两者?
Candidate: Is this a mobile app? Or a web app? Or both?
采访者:两者都有。
Interviewer: Both.
候选人:这个应用程序的规模有多大?初创应用程序还是大规模应用程序?
Candidate: What is the scale of this app? A startup app or massive scale?
面试官:应该支持5000万日活跃用户(DAU)。
Interviewer: It should support 50 million daily active users (DAU).
应聘者:群聊的话,群成员限制是多少?
Candidate: For group chat, what is the group member limit?
面试人数:最多100人
Interviewer: A maximum of 100 people
候选人:对于聊天应用程序来说,哪些功能很重要?可以支持附件吗?
Candidate: What features are important for the chat app? Can it support attachment?
面试官:1对1聊天、群聊、在线指标。系统仅支持短信。
Interviewer: 1 on 1 chat, group chat, online indicator. The system only supports text messages.
考生:邮件大小有限制吗?
Candidate: Is there a message size limit?
面试官:是的,文字长度应该小于10万字符。
Interviewer: Yes, text length should be less than 100,000 characters long.
考生:需要端到端加密吗?
Candidate: Is end-to-end encryption required?
面试官:暂时不需要,但如果时间允许我们会讨论。
Interviewer: Not required for now but we will discuss that if time allows.
应聘者:聊天记录要保存多久?
Candidate: How long shall we store the chat history?
采访者:永远。
Interviewer: Forever.
在本章中,我们专注于设计像 Facebook Messenger 这样的聊天应用程序,重点关注以下功能:
In the chapter, we focus on designing a chat app like Facebook messenger, with an emphasis on the following features:
•一对一聊天,传输延迟低
•A one-on-one chat with low delivery latency
•小组聊天(最多 100 人)
•Small group chat (max of 100 people)
•在线状态
•Online presence
•多设备支持。同一账号可以同时登录多个账号。
•Multiple device support. The same account can be logged in to multiple accounts at the same time.
•推送通知
•Push notifications
就设计规模达成一致也很重要。我们将设计一个支持5000万DAU的系统。
It is also important to agree on the design scale. We will design a system that supports 50 million DAU.
为了开发高质量的设计,我们应该对客户端和服务器如何通信有基本的了解。在聊天系统中,客户端可以是移动应用程序,也可以是 Web 应用程序。客户端之间不直接通信。相反,每个客户端都连接到一个聊天服务,该服务支持上述所有功能。让我们专注于基本操作。聊天服务必须支持以下功能:
To develop a high-quality design, we should have a basic knowledge of how clients and servers communicate. In a chat system, clients can be either mobile applications or web applications. Clients do not communicate directly with each other. Instead, each client connects to a chat service, which supports all the features mentioned above. Let us focus on fundamental operations. The chat service must support the following functions:
•接收来自其他客户端的消息。
•Receive messages from other clients.
•为每条消息找到正确的收件人并将消息转发给收件人。
•Find the right recipients for each message and relay the message to the recipients.
•如果收件人不在线,请在服务器上保留该收件人的邮件,直到她在线。
•If a recipient is not online, hold the messages for that recipient on the server until she is online.
图 12-2 显示了客户端(发送者和接收者)和聊天服务之间的关系。
Figure 12-2 shows the relationships between clients (sender and receiver) and the chat service.
当客户端打算开始聊天时,它会使用一种或多种网络协议连接聊天服务。对于聊天服务来说,网络协议的选择很重要。让我们和面试官讨论一下这个问题。
When a client intends to start a chat, it connects the chats service using one or more network protocols. For a chat service, the choice of network protocols is important. Let us discuss this with the interviewer.
对于大多数客户端/服务器应用程序来说,请求都是由客户端发起的。对于聊天应用程序的发送方来说也是如此。在图12-2中,当发送者通过聊天服务向接收者发送消息时,它使用经过时间考验的HTTP协议,这是最常见的Web协议。在此场景中,客户端打开与聊天服务的 HTTP 连接并发送消息,通知服务将消息发送给接收者。保持活动对此非常有效,因为保持活动标头允许客户端保持与聊天服务的持久连接。它还减少了 TCP 握手的次数。对于发送方来说,HTTP 是一个不错的选择,许多流行的聊天应用程序(例如 Facebook [1])最初都使用 HTTP 来发送消息。
Requests are initiated by the client for most client/server applications. This is also true for the sender side of a chat application. In Figure 12-2, when the sender sends a message to the receiver via the chat service, it uses the time-tested HTTP protocol, which is the most common web protocol. In this scenario, the client opens a HTTP connection with the chat service and sends the message, informing the service to send the message to the receiver. The keep-alive is efficient for this because the keep-alive header allows a client to maintain a persistent connection with the chat service. It also reduces the number of TCP handshakes. HTTP is a fine option on the sender side, and many popular chat applications such as Facebook [1] used HTTP initially to send messages.
然而,接收端有点复杂。由于 HTTP 是客户端发起的,因此从服务器发送消息并非易事。多年来,使用了许多技术来模拟服务器发起的连接:轮询、长轮询和 WebSocket。这些是系统设计面试中广泛使用的重要技术,因此让我们逐一研究一下。
However, the receiver side is a bit more complicated. Since HTTP is client-initiated, it is not trivial to send messages from the server. Over the years, many techniques are used to simulate a server-initiated connection: polling, long polling, and WebSocket. Those are important techniques widely used in system design interviews so let us examine each of them.
如图12-3所示,轮询是一种客户端定期询问服务器是否有可用消息的技术。根据轮询频率,轮询的成本可能很高。它可能会消耗宝贵的服务器资源来回答大多数时候提供“否”答案的问题。
As shown in Figure 12-3, polling is a technique that the client periodically asks the server if there are messages available. Depending on polling frequency, polling could be costly. It could consume precious server resources to answer a question that offers no as an answer most of the time.
因为轮询可能效率低下,所以下一步是长轮询(图 12-4)。
Because polling could be inefficient, the next progression is long polling (Figure 12-4).
在长轮询中,客户端保持连接打开,直到实际有新消息可用或达到超时阈值。一旦客户端收到新消息,它立即向服务器发送另一个请求,重新启动该过程。长轮询有一些缺点:
In long polling, a client holds the connection open until there are actually new messages available or a timeout threshold has been reached. Once the client receives new messages, it immediately sends another request to the server, restarting the process. Long polling has a few drawbacks:
•发送者和接收者不能连接到同一聊天服务器。基于 HTTP 的服务器通常是无状态的。如果使用循环进行负载平衡,接收消息的服务器可能不会与接收消息的客户端建立长轮询连接。
•Sender and receiver may not connect to the same chat server. HTTP based servers are usually stateless. If you use round robin for load balancing, the server that receives the message might not have a long-polling connection with the client who receives the message.
•服务器没有好的方法来判断客户端是否已断开连接。
•A server has no good way to tell if a client is disconnected.
•效率低下。如果用户聊天不多,长轮询在超时后仍会定期建立连接。
•It is inefficient. If a user does not chat much, long polling still makes periodic connections after timeouts.
WebSocket 是从服务器向客户端发送异步更新的最常见解决方案。图 12-5 显示了它的工作原理。
WebSocket is the most common solution for sending asynchronous updates from server to client. Figure 12-5 shows how it works.
WebSocket连接由客户端发起。它是双向的且持久的。它以 HTTP 连接开始,可以通过一些定义明确的握手“升级”为 WebSocket 连接。通过这种持久连接,服务器可以向客户端发送更新。即使存在防火墙,WebSocket 连接通常也能正常工作。这是因为它们使用端口 80 或 443,HTTP/HTTPS 连接也使用这些端口。
WebSocket connection is initiated by the client. It is bi-directional and persistent. It starts its life as a HTTP connection and could be “upgraded” via some well-defined handshake to a WebSocket connection. Through this persistent connection, a server could send updates to a client. WebSocket connections generally work even if a firewall is in place. This is because they use port 80 or 443 which are also used by HTTP/HTTPS connections.
前面我们说过,在发送方,HTTP 是一个很好使用的协议,但由于 WebSocket 是双向的,因此没有强有力的技术理由不使用它来接收。图 12-6 显示了 WebSockets (ws) 如何用于发送方和接收方。
Earlier we said that on the sender side HTTP is a fine protocol to use, but since WebSocket is bidirectional, there is no strong technical reason not to use it also for receiving. Figure 12-6 shows how WebSockets (ws) is used for both sender and receiver sides.
通过使用 WebSocket 进行发送和接收,它简化了设计,并使客户端和服务器上的实现更加简单。由于 WebSocket 连接是持久的,因此高效的连接管理在服务器端至关重要。
By using WebSocket for both sending and receiving, it simplifies the design and makes implementation on both client and server more straightforward. Since WebSocket connections are persistent, efficient connection management is critical on the server-side.
刚才我们提到,选择 WebSocket 作为客户端和服务器之间双向通信的主要通信协议,值得注意的是,其他一切不一定都是 WebSocket。事实上,聊天应用程序的大多数功能(注册、登录、用户配置文件等)都可以使用基于 HTTP 的传统请求/响应方法。让我们深入研究一下系统的高级组件。
Just now we mentioned that WebSocket was chosen as the main communication protocol between the client and server for its bidirectional communication, it is important to note that everything else does not have to be WebSocket. In fact, most features (sign up, login, user profile, etc) of a chat application could use the traditional request/response method over HTTP. Let us drill in a bit and look at the high-level components of the system.
如图12-7所示,聊天系统分为三大类:无状态服务、有状态服务和第三方集成。
As shown in Figure 12-7, the chat system is broken down into three major categories: stateless services, stateful services, and third-party integration.
无状态服务是传统的面向公众的请求/响应服务,用于管理登录、注册、用户配置文件等。这些是许多网站和应用程序的常见功能。
Stateless services are traditional public-facing request/response services, used to manage the login, signup, user profile, etc. These are common features among many websites and apps.
无状态服务位于负载均衡器后面,其工作是根据请求路径将请求路由到正确的服务。这些服务可以是整体的或单独的微服务。我们不需要自己构建许多无状态服务,因为市场上有可以轻松集成的服务。我们将深入讨论的一项服务是服务发现。它的主要工作是向客户端提供客户端可以连接的聊天服务器的 DNS 主机名列表。
Stateless services sit behind a load balancer whose job is to route requests to the correct services based on the request paths. These services can be monolithic or individual microservices. We do not need to build many of these stateless services by ourselves as there are services in the market that can be integrated easily. The one service that we will discuss more in deep dive is the service discovery. Its primary job is to give the client a list of DNS host names of chat servers that the client could connect to.
唯一有状态的服务是聊天服务。该服务是有状态的,因为每个客户端都维护与聊天服务器的持久网络连接。在此服务中,只要服务器仍然可用,客户端通常不会切换到另一个聊天服务器。服务发现与聊天服务密切配合,以避免服务器过载。我们将深入探讨细节。
The only stateful service is the chat service. The service is stateful because each client maintains a persistent network connection to a chat server. In this service, a client normally does not switch to another chat server as long as the server is still available. The service discovery coordinates closely with the chat service to avoid server overloading. We will go into detail in deep dive.
对于聊天应用程序来说,推送通知是最重要的第三方集成。这是一种在新消息到达时通知用户的方法,即使应用程序未运行也是如此。推送通知的正确集成至关重要。有关详细信息,请参阅第 10 章设计通知系统。
For a chat app, push notification is the most important third-party integration. It is a way to inform users when new messages have arrived, even when the app is not running. Proper integration of push notification is crucial. Refer to Chapter 10 Design a notification system for more information.
在小范围内,上面列出的所有服务都可以容纳在一台服务器中。即使按照我们设计的规模,理论上也可以在一台现代云服务器中容纳所有用户连接。服务器可以处理的并发连接数很可能是限制因素。在我们的场景中,在 1M 并发用户的情况下,假设每个用户连接需要服务器上 10K 的内存(这是一个非常粗略的数字,并且非常依赖于语言的选择),则只需要大约 10GB 的内存来容纳一台服务器上的所有连接。盒子。
On a small scale, all services listed above could fit in one server. Even at the scale we design for, it is in theory possible to fit all user connections in one modern cloud server. The number of concurrent connections that a server can handle will most likely be the limiting factor. In our scenario, at 1M concurrent users, assuming each user connection needs 10K of memory on the server (this is a very rough figure and very dependent on the language choice), it only needs about 10GB of memory to hold all the connections on one box.
如果我们提出一种将所有内容都放在一台服务器中的设计,这可能会在面试官的脑海中引发一个大危险信号。没有技术人员会在单个服务器中设计如此规模的产品。由于多种因素,单一服务器设计会破坏交易。其中单点故障是最大的。
If we propose a design where everything fits in one server, this may raise a big red flag in the interviewer’s mind. No technologist would design such a scale in a single server. Single server design is a deal breaker due to many factors. The single point of failure is the biggest among them.
然而,从单一服务器设计开始是完全可以的。只要确保面试官知道这是一个起点即可。将我们提到的所有内容放在一起,图 12-8 显示了调整后的高层设计。
However, it is perfectly fine to start with a single server design. Just make sure the interviewer knows this is a starting point. Putting everything we mentioned together, Figure 12-8 shows the adjusted high-level design.
在图 12-8 中,客户端维护与聊天服务器的持久 WebSocket 连接以进行实时消息传递。
In Figure 12-8, the client maintains a persistent WebSocket connection to a chat server for real-time messaging.
•聊天服务器促进消息发送/接收。
•Chat servers facilitate message sending/receiving.
•状态服务器管理在线/离线状态。
•Presence servers manage online/offline status.
• API 服务器处理一切事务,包括用户登录、注册、更改个人资料等。
•API servers handle everything including user login, signup, change profile, etc.
•通知服务器发送推送通知。
•Notification servers send push notifications.
•最后,键值存储用于存储聊天历史记录。当离线用户上线时,她将看到她之前的所有聊天记录。
•Finally, the key-value store is used to store chat history. When an offline user comes online, she will see all her previous chat history.
至此,我们已经准备好服务器、运行服务并完成第三方集成。技术堆栈的深处是数据层。数据层通常需要一些努力才能使其正确。我们必须做出的一个重要决定是决定使用正确的数据库类型:关系数据库还是 NoSQL 数据库?为了做出明智的决定,我们将检查数据类型和读/写模式。
At this point, we have servers ready, services up running and third-party integrations complete. Deep down the technical stack is the data layer. Data layer usually requires some effort to get it correct. An important decision we must make is to decide on the right type of database to use: relational databases or NoSQL databases? To make an informed decision, we will examine the data types and read/write patterns.
典型的聊天系统中存在两种类型的数据。第一个是通用数据,例如用户个人资料、设置、用户好友列表。这些数据存储在强大且可靠的关系数据库中。复制和分片是满足可用性和可扩展性要求的常用技术。
Two types of data exist in a typical chat system. The first is generic data, such as user profile, setting, user friends list. These data are stored in robust and reliable relational databases. Replication and sharding are common techniques to satisfy availability and scalability requirements.
第二个是聊天系统特有的:聊天历史数据。了解读/写模式很重要。
The second is unique to chat systems: chat history data. It is important to understand the read/write pattern.
•聊天系统的数据量非常巨大。之前的一项研究 [2] 显示 Facebook Messenger 和 Whatsapp 每天处理 600 亿条消息。
•The amount of data is enormous for chat systems. A previous study [2] reveals that Facebook messenger and Whatsapp process 60 billion messages a day.
•仅经常访问最近的聊天记录。用户通常不会查找旧聊天记录。
•Only recent chats are accessed frequently. Users do not usually look up for old chats.
•虽然在大多数情况下会查看最近的聊天历史记录,但用户可能会使用需要随机访问数据的功能,例如搜索、查看您的提及、跳转到特定消息等。这些情况应得到数据访问层的支持。
•Although very recent chat history is viewed in most cases, users might use features that require random access of data, such as search, view your mentions, jump to specific messages, etc. These cases should be supported by the data access layer.
• 1 对1 聊天应用程序的读写比率约为1:1。
•The read to write ratio is about 1:1 for 1 on 1 chat apps.
选择支持我们所有用例的正确存储系统至关重要。我们推荐键值存储的原因如下:
Selecting the correct storage system that supports all of our use cases is crucial. We recommend key-value stores for the following reasons:
•键值存储允许轻松水平扩展。
•Key-value stores allow easy horizontal scaling.
•键值存储提供极低的数据访问延迟。
•Key-value stores provide very low latency to access data.
•关系数据库不能很好地处理长尾数据[3]。当索引变大时,随机访问的成本很高。
•Relational databases do not handle long tail [3] of data well. When the indexes grow large, random access is expensive.
•其他经过验证的可靠聊天应用程序采用键值存储。例如,Facebook Messenger 和 Discord 都使用键值存储。Facebook Messenger 使用 HBase [4],Discord 使用 Cassandra [5]。
•Key-value stores are adopted by other proven reliable chat applications. For example, both Facebook messenger and Discord use key-value stores. Facebook messenger uses HBase [4], and Discord uses Cassandra [5].
刚才,我们讨论了使用键值存储作为我们的存储层。最重要的数据是消息数据。让我们仔细看看。
Just now, we talked about using key-value stores as our storage layer. The most important data is message data. Let us take a close look.
图12-9显示了1对1聊天的消息表。主键是message_id ,它有助于决定消息顺序。我们不能依靠created_at来决定消息顺序,因为可以同时创建两条消息。
Figure 12-9 shows the message table for 1 on 1 chat. The primary key is message_id, which helps to decide message sequence. We cannot rely on created_at to decide the message sequence because two messages can be created at the same time.
群聊消息表如图12-10所示。复合主键是(channel_id,message_id)。通道和组在这里代表相同的含义。channel_id是分区键,因为群聊中的所有查询都在通道中操作。
Figure 12-10 shows the message table for group chat. The composite primary key is (channel_id, message_id). Channel and group represent the same meaning here. channel_id is the partition key because all queries in a group chat operate in a channel.
如何生成message_id是一个值得探讨的有趣话题。Message_id负责保证消息的顺序。为了确定消息的顺序,message_id必须满足以下两个要求:
How to generate message_id is an interesting topic worth exploring. Message_id carries the responsibility of ensuring the order of messages. To ascertain the order of messages, message_id must satisfy the following two requirements:
• ID 必须是唯一的。
•IDs must be unique.
• ID 应可按时间排序,这意味着新行的 ID 高于旧行。
•IDs should be sortable by time, meaning new rows have higher IDs than old ones.
如何才能实现这两个保证呢?我想到的第一个想法是MySql中的“ auto_increment ”关键字。然而,NoSQL 数据库通常不提供这样的功能。
How can we achieve those two guarantees? The first idea that comes to mind is the “auto_increment” keyword in MySql. However, NoSQL databases usually do not provide such a feature.
第二种方法是使用全局 64 位序列号生成器,如 Snowflake [6]。这将在“第 7 章:在分布式系统中设计唯一的 ID 生成器”中讨论。
The second approach is to use a global 64-bit sequence number generator like Snowflake [6]. This is discussed in “Chapter 7: Design a unique ID generator in a distributed system”.
最后的方法是使用本地序列号生成器。本地意味着 ID 仅在组内是唯一的。本地ID起作用的原因是维持一对一通道或组通道内的消息顺序就足够了。与全局 ID 实现相比,这种方法更容易实现。
The final approach is to use local sequence number generator. Local means IDs are only unique within a group. The reason why local IDs work is that maintaining message sequence within one-on-one channel or a group channel is sufficient. This approach is easier to implement in comparison to the global ID implementation.
在系统设计面试中,通常希望您深入研究高层设计中的某些组件。对于聊天系统来说,服务发现、消息流、上下线指标等都值得深入探索。
In a system design interview, usually you are expected to dive deep into some of the components in the high-level design. For the chat system, service discovery, messaging flows, and online/offline indicators worth deeper exploration.
服务发现的主要作用是根据地理位置、服务器容量等标准为客户端推荐最佳的聊天服务器。Apache Zookeeper [7] 是一种流行的服务发现开源解决方案。它注册所有可用的聊天服务器,并根据预定义的标准为客户端选择最佳的聊天服务器。
The primary role of service discovery is to recommend the best chat server for a client based on the criteria like geographical location, server capacity, etc. Apache Zookeeper [7] is a popular open-source solution for service discovery. It registers all the available chat servers and picks the best chat server for a client based on predefined criteria.
图12-11展示了服务发现(Zookeeper)的工作原理。
Figure 12-11 shows how service discovery (Zookeeper) works.
1. 用户A尝试登录应用程序。
1. User A tries to log in to the app.
2. 负载均衡器向API服务器发送登录请求。
2. The load balancer sends the login request to API servers.
3. 后端对用户进行身份验证后,服务发现为用户 A 找到最佳的聊天服务器。本例中选择服务器 2,并将服务器信息返回给用户 A。
3. After the backend authenticates the user, service discovery finds the best chat server for User A. In this example, server 2 is chosen and the server info is returned back to User A.
4. User A connects to chat server 2 through WebSocket.
了解聊天系统的端到端流程很有趣。在本节中,我们将探讨 1 对 1 聊天流程、跨多个设备的消息同步以及群聊流程。
It is interesting to understand the end-to-end flow of a chat system. In this section, we will explore 1 on 1 chat flow, message synchronization across multiple devices and group chat flow.
图12-12解释了当用户A向用户B发送消息时发生的情况。
Figure 12-12 explains what happens when User A sends a message to User B.
1. 用户A向聊天服务器1发送聊天消息。
1. User A sends a chat message to Chat server 1.
2、聊天服务器1从ID生成器获取消息ID。
2. Chat server 1 obtains a message ID from the ID generator.
3. 聊天服务器1将消息发送到消息同步队列。
3. Chat server 1 sends the message to the message sync queue.
4. 消息存储在键值存储中。
4. The message is stored in a key-value store.
5.a. 如果用户 B 在线,则消息将转发到用户 B 连接的聊天服务器 2。
5.a. If User B is online, the message is forwarded to Chat server 2 where User B is connected.
5.b. 如果用户 B 离线,则从推送通知 (PN) 服务器发送推送通知。
5.b. If User B is offline, a push notification is sent from push notification (PN) servers.
6. 聊天服务器 2 将消息转发给用户 B。用户 B 和聊天服务器 2 之间存在持久的 WebSocket 连接。
6. Chat server 2 forwards the message to User B. There is a persistent WebSocket connection between User B and Chat server 2.
许多用户拥有多个设备。我们将解释如何跨多个设备同步消息。消息同步示例如图12-13所示。
Many users have multiple devices. We will explain how to sync messages across multiple devices. Figure 12-13 shows an example of message synchronization.
在图12-13中,用户A有两台设备:手机和笔记本电脑。当用户 A 使用手机登录聊天应用程序时,它会与聊天服务器 1 建立 WebSocket 连接。同样,笔记本电脑和聊天服务器 1 之间也存在连接。
In Figure 12-13, user A has two devices: a phone and a laptop. When User A logs in to the chat app with her phone, it establishes a WebSocket connection with Chat server 1. Similarly, there is a connection between the laptop and Chat server 1.
每个设备都维护一个名为cur_max_message_id的变量,它跟踪设备上的最新消息 ID。满足以下两个条件的消息被视为新闻消息:
Each device maintains a variable called cur_max_message_id, which keeps track of the latest message ID on the device. Messages that satisfy the following two conditions are considered as news messages:
•收件人ID 等于当前登录的用户ID。
•The recipient ID is equal to the currently logged-in user ID.
•键值存储中的消息ID大于cur_max_message_id 。
•Message ID in the key-value store is larger than cur_max_message_id.
每个设备上都有不同的cur_max_message_id ,消息同步很容易,因为每个设备都可以从 KV 存储中获取新消息。
With distinct cur_max_message_id on each device, message synchronization is easy as each device can get new messages from the KV store.
与一对一聊天相比,群聊的逻辑更加复杂。图 12-14 和 12-15 解释了该流程。
In comparison to the one-on-one chat, the logic of group chat is more complicated. Figures 12-14 and 12-15 explain the flow.
图12-14解释了用户A在群聊中发送消息时发生的情况。假设该组中有 3 个成员(用户 A、用户 B 和用户 C)。首先,来自用户 A 的消息被复制到每个组成员的消息同步队列:一个用于用户 B,第二个用于用户 C。您可以将消息同步队列视为收件人的收件箱。这种设计选择非常适合小组聊天,因为:
Figure 12-14 explains what happens when User A sends a message in a group chat. Assume there are 3 members in the group (User A, User B and user C). First, the message from User A is copied to each group member’s message sync queue: one for User B and the second for User C. You can think of the message sync queue as an inbox for a recipient. This design choice is good for small group chat because:
•它简化了消息同步流程,因为每个客户端只需检查自己的收件箱即可获取新消息。
•it simplifies message sync flow as each client only needs to check its own inbox to get new messages.
•当组数较小时,在每个收件人的收件箱中存储一份副本并不会太昂贵。
•when the group number is small, storing a copy in each recipient’s inbox is not too expensive.
微信也采用了类似的方法,它将一个群组的成员限制为 500 人 [8]。但是,对于拥有大量用户的组来说,为每个成员存储消息副本是不可接受的。
WeChat uses a similar approach, and it limits a group to 500 members [8]. However, for groups with a lot of users, storing a message copy for each member is not acceptable.
在接收方,一个接收方可以接收来自多个用户的消息。每个收件人都有一个收件箱(消息同步队列),其中包含来自不同发件人的消息。图 12-15 说明了该设计。
On the recipient side, a recipient can receive messages from multiple users. Each recipient has an inbox (message sync queue) which contains messages from different senders. Figure 12-15 illustrates the design.
在线状态指示器是许多聊天应用程序的基本功能。通常,您可以在用户的个人资料图片或用户名旁边看到一个绿点。本节解释幕后发生的事情。
An online presence indicator is an essential feature of many chat applications. Usually, you can see a green dot next to a user’s profile picture or username. This section explains what happens behind the scenes.
在高层设计中,呈现服务器负责管理在线状态并通过WebSocket与客户端进行通信。有一些流程会触发在线状态更改。让我们逐一考察一下。
In the high-level design, presence servers are responsible for managing online status and communicating with clients through WebSocket. There are a few flows that will trigger online status change. Let us examine each of them.
用户登录流程在“服务发现”部分中进行了解释。客户端与实时服务建立WebSocket连接后,用户A的在线状态和last_active_at时间戳会保存在KV存储中。状态指示器显示用户登录后处于在线状态。
The user login flow is explained in the “Service Discovery” section. After a WebSocket connection is built between the client and the real-time service, user A’s online status and last_active_at timestamp are saved in the KV store. Presence indicator shows the user is online after she logs in.
用户注销时,用户注销流程如图12-17所示。KV商店中在线状态变为离线。状态指示器显示用户离线。
When a user logs out, it goes through the user logout flow as shown in Figure 12-17. The online status is changed to offline in the KV store. The presence indicator shows a user is offline.
我们都希望我们的互联网连接是一致且可靠的。然而,情况并非总是如此。因此,我们必须在设计中解决这个问题。当用户与互联网断开连接时,客户端和服务器之间的持久连接就会丢失。处理用户断开连接的一种简单方法是将用户标记为离线,并在连接重新建立时将状态更改为在线。然而,这种方法有一个重大缺陷。用户在短时间内频繁断开和重新连接互联网是很常见的。例如,当用户穿过隧道时,网络连接可以打开和关闭。每次断开/重新连接时更新在线状态会使状态指示符更改过于频繁,从而导致用户体验不佳。
We all wish our internet connection is consistent and reliable. However, that is not always the case; thus, we must address this issue in our design. When a user disconnects from the internet, the persistent connection between the client and server is lost. A naive way to handle user disconnection is to mark the user as offline and change the status to online when the connection re-establishes. However, this approach has a major flaw. It is common for users to disconnect and reconnect to the internet frequently in a short time. For example, network connections can be on and off while a user goes through a tunnel. Updating online status on every disconnect/reconnect would make the presence indicator change too often, resulting in poor user experience.
我们引入心跳机制来解决这个问题。在线客户端定期向状态服务器发送心跳事件。如果在线状态服务器在一定时间内(例如 x 秒)从客户端接收到心跳事件,则认为用户在线。否则就是离线状态。
We introduce a heartbeat mechanism to solve this problem. Periodically, an online client sends a heartbeat event to presence servers. If presence servers receive a heartbeat event within a certain time, say x seconds from the client, a user is considered as online. Otherwise, it is offline.
在图12-18中,客户端每5秒向服务器发送一次心跳事件。发送 3 个心跳事件后,客户端断开连接,并且在n x = 30 秒内不会重新连接(该数字是任意选择的以演示逻辑)。在线状态更改为离线。
In Figure 12-18, the client sends a heartbeat event to the server every 5 seconds. After sending 3 heartbeat events, the client is disconnected and does not reconnect within x = 30 seconds (This number is arbitrarily chosen to demonstrate the logic). The online status is changed to offline.
用户A的好友如何知道状态变化?图 12-19 解释了它的工作原理。状态服务器使用发布-订阅模型,其中每个朋友对维护一个通道。当用户A在线状态发生变化时,将事件发布到三个通道:通道AB、AC、AD。这三个频道分别由用户 B、C 和 D 订阅。这样,朋友就可以轻松获得在线状态更新。客户端和服务器之间的通信是通过实时WebSocket进行的。
How do user A’s friends know about the status changes? Figure 12-19 explains how it works. Presence servers use a publish-subscribe model, in which each friend pair maintains a channel. When User A’s online status changes, it publishes the event to three channels, channel A-B, A-C, and A-D. Those three channels are subscribed by User B, C, and D, respectively. Thus, it is easy for friends to get online status updates. The communication between clients and servers is through real-time WebSocket.
上述设计对于小用户群来说是有效的。例如,微信就采用了类似的方法,因为它的用户群体上限为 500 人。对于较大的群体,向所有成员通报在线状态既昂贵又耗时。假设一个群组有 100,000 名成员。每次状态更改都会生成 100,000 个事件。为了解决性能瓶颈,一种可能的解决方案是仅在用户进入群组或手动刷新好友列表时才获取在线状态。
The above design is effective for a small user group. For instance, WeChat uses a similar approach because its user group is capped to 500. For larger groups, informing all members about online status is expensive and time consuming. Assume a group has 100,000 members. Each status change will generate 100,000 events. To solve the performance bottleneck, a possible solution is to fetch online status only when a user enters a group or manually refreshes the friend list.
在本章中,我们提出了一个同时支持一对一聊天和小组聊天的聊天系统架构。WebSocket用于客户端和服务器之间的实时通信。聊天系统包含以下组件:用于实时消息传递的聊天服务器、用于管理在线状态的状态服务器、用于发送推送通知的推送通知服务器、用于保存聊天历史记录的键值存储以及用于其他功能的 API 服务器。
In this chapter, we presented a chat system architecture that supports both 1-to-1 chat and small group chat. WebSocket is used for real-time communication between the client and server. The chat system contains the following components: chat servers for real-time messaging, presence servers for managing online presence, push notification servers for sending push notifications, key-value stores for chat history persistence and API servers for other functionalities.
如果您在采访结束时还有额外的时间,请参阅以下额外谈话要点:
If you have extra time at the end of the interview, here are additional talking points:
•扩展聊天应用程序以支持照片和视频等媒体文件。媒体文件的大小明显大于文本。压缩、云存储和缩略图都是有趣的话题。
•Extend the chat app to support media files such as photos and videos. Media files are significantly larger than text in size. Compression, cloud storage, and thumbnails are interesting topics to talk about.
•端到端加密。Whatsapp 支持消息的端到端加密。只有发件人和收件人才能阅读消息。有兴趣的读者可以参考参考资料中的文章[9]。
•End-to-end encryption. Whatsapp supports end-to-end encryption for messages. Only the sender and the recipient can read messages. Interested readers should refer to the article in the reference materials [9].
•在客户端缓存消息可以有效减少客户端和服务器之间的数据传输。
•Caching messages on the client-side is effective to reduce the data transfer between the client and server.
•缩短加载时间。Slack 构建了一个地理分布式网络来缓存用户的数据、频道等,以获得更好的加载时间 [10]。
•Improve load time. Slack built a geographically distributed network to cache users’ data, channels, etc. for better load time [10].
•错误处理。
•Error handling.
•聊天服务器错误。与聊天服务器的持久连接可能有数十万甚至更多。如果聊天服务器离线,服务发现(Zookeeper)将提供一个新的聊天服务器供客户端建立新的连接。
•The chat server error. There might be hundreds of thousands, or even more persistent connections to a chat server. If a chat server goes offline, service discovery (Zookeeper) will provide a new chat server for clients to establish new connections with.
•消息重发机制。重试和排队是重新发送消息的常用技术。
•Message resent mechanism. Retry and queueing are common techniques for resending messages.
恭喜您已经走到这一步了!现在拍拍自己的背吧。好工作!
Congratulations on getting this far! Now give yourself a pat on the back. Good job!
参考资料
Reference materials
[1] Facebook 上的 Erlang:https://www.erlang-factory.com/upload/presentations/31/EugeneLetuchy-ErlangatFacebook.pdf
[1] Erlang at Facebook: https://www.erlang-factory.com/upload/presentations/31/EugeneLetuchy-ErlangatFacebook.pdf
[2] Messenger 和 WhatsApp 每天处理 600 亿条消息:https://www.theverge.com/2016/4/12/11415198/facebook-messenger-whatsapp-number-messages-vs-sms-f8-2016
[2] Messenger and WhatsApp process 60 billion messages a day: https://www.theverge.com/2016/4/12/11415198/facebook-messenger-whatsapp-number-messages-vs-sms-f8-2016
[3] 长尾: https: //en.wikipedia.org/wiki/Long_tail
[3] Long tail: https://en.wikipedia.org/wiki/Long_tail
[4] The Underlying Technology of Messages: https://www.facebook.com/notes/facebook-engineering/the-underlying-technology-of-messages/454991608919/
[5] Discord 如何存储数十亿条消息:https://blog.discordapp.com/how-discord-stores-billions-of-messages-7fa6ec7ee4c7
[5] How Discord Stores Billions of Messages: https://blog.discordapp.com/how-discord-stores-billions-of-messages-7fa6ec7ee4c7
[6] 宣布雪花:https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html
[6] Announcing Snowflake: https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html
[7] Apache ZooKeeper:https://zookeeper.apache.org/
[7] Apache ZooKeeper: https://zookeeper.apache.org/
[8] 从无到有:微信后台系统的演变(中文文章):https://www.infoq.cn/article/the-road-of-the-growth-weixin-background
[8] From nothing: the evolution of WeChat background system (Article in Chinese): https://www.infoq.cn/article/the-road-of-the-growth-weixin-background
[9] 端到端加密:https ://faq.whatsapp.com/en/android/28030015/
[9] End-to-end encryption: https://faq.whatsapp.com/en/android/28030015/
[10] Flannel:实现 Slack 规模的应用程序级边缘缓存:https://slack.engineering/flannel-an-application-level-edge-cache-to-make-slack-scale-b8a6400e2f6b
[10] Flannel: An Application-Level Edge Cache to Make Slack Scale: https://slack.engineering/flannel-an-application-level-edge-cache-to-make-slack-scale-b8a6400e2f6b
在 Google 上搜索或在 Amazon 上购物时,当您在搜索框中输入内容时,系统会向您显示一个或多个与搜索词匹配的内容。此功能称为自动完成、预先输入、键入时搜索或增量搜索。图 13-1 展示了一个 Google 搜索示例,当在搜索框中输入“dinner”时,会显示自动完成的结果列表。搜索自动完成是许多产品的重要功能。这引出了我们的面试问题:设计一个搜索自动完成系统,也称为“设计前k个”或“设计前k个最常搜索的查询”。
When searching on Google or shopping at Amazon, as you type in the search box, one or more matches for the search term are presented to you. This feature is referred to as autocomplete, typeahead, search-as-you-type, or incremental search. Figure 13-1 presents an example of a Google search showing a list of autocompleted results when “dinner” is typed into the search box. Search autocomplete is an important feature of many products. This leads us to the interview question: design a search autocomplete system, also called “design top k” or “design top k most searched queries”.
解决任何系统设计面试问题的第一步是提出足够的问题来澄清需求。以下是候选人与面试官互动的示例:
The first step to tackle any system design interview question is to ask enough questions to clarify requirements. Here is an example of candidate-interviewer interaction:
候选人:匹配仅支持在搜索查询的开头还是中间也支持?
Candidate: Is the matching only supported at the beginning of a search query or in the middle as well?
面试官:仅在搜索查询的开头。
Interviewer: Only at the beginning of a search query.
候选人:系统应该返回多少条自动完成建议?
Candidate: How many autocomplete suggestions should the system return?
面试者:5
Interviewer: 5
候选人:系统如何知道返回哪 5 条建议?
Candidate: How does the system know which 5 suggestions to return?
面试官:这个是由流行度决定的,由历史查询频率决定的。
Interviewer: This is determined by popularity, decided by the historical query frequency.
考生:系统支持拼写检查吗?
Candidate: Does the system support spell check?
面试官:不,不支持拼写检查或自动更正。
Interviewer: No, spell check or autocorrect is not supported.
应聘者:搜索查询是英文的吗?
Candidate: Are search queries in English?
采访者:是的。如果最后时间允许,我们可以讨论多语言支持。
Interviewer: Yes. If time allows at the end, we can discuss multi-language support.
候选人:我们允许大写和特殊字符吗?
Candidate: Do we allow capitalization and special characters?
面试官:不,我们假设所有搜索查询都有小写字母字符。
Interviewer: No, we assume all search queries have lowercase alphabetic characters.
应聘者:有多少用户使用该产品?
Candidate: How many users use the product?
面试官:1000万DAU。
Interviewer: 10 million DAU.
要求
Requirements
以下是要求摘要:
Here is a summary of the requirements:
•快速响应时间:当用户键入搜索查询时,自动完成建议必须足够快地显示。一篇关于 Facebook 自动完成系统的文章 [1] 揭示了该系统需要在 100 毫秒内返回结果。否则会造成口吃。
•Fast response time: As a user types a search query, autocomplete suggestions must show up fast enough. An article about Facebook’s autocomplete system [1] reveals that the system needs to return results within 100 milliseconds. Otherwise it will cause stuttering.
•相关:自动完成建议应与搜索词相关。
•Relevant: Autocomplete suggestions should be relevant to the search term.
•排序:系统返回的结果必须按受欢迎程度或其他排名模型排序。
•Sorted: Results returned by the system must be sorted by popularity or other ranking models.
•可扩展:系统可以处理高流量。
•Scalable: The system can handle high traffic volume.
•高可用性:当系统的一部分脱机、速度变慢或遇到意外的网络错误时,系统应保持可用且可访问。
•Highly available: The system should remain available and accessible when part of the system is offline, slows down, or experiences unexpected network errors.
•假设每日活跃用户(DAU) 为1000 万。
•Assume 10 million daily active users (DAU).
•一般人每天执行 10 次搜索。
•An average person performs 10 searches per day.
•每个查询字符串 20 个字节的数据:
•20 bytes of data per query string:
•假设我们使用ASCII 字符编码。1 个字符 = 1 个字节
•Assume we use ASCII character encoding. 1 character = 1 byte
•假设查询包含4 个单词,每个单词平均包含5 个字符。
•Assume a query contains 4 words, and each word contains 5 characters on average.
•即每个查询4 x 5 = 20 字节。
•That is 4 x 5 = 20 bytes per query.
•对于在搜索框中输入的每个字符,客户端都会向后端发送请求以获取自动完成建议。平均而言,每个搜索查询会发送 20 个请求。例如,当您输入完“dinner”时,以下 6 个请求将发送到后端。
•For every character entered into the search box, a client sends a request to the backend for autocomplete suggestions. On average, 20 requests are sent for each search query. For example, the following 6 requests are sent to the backend by the time you finish typing “dinner”.
搜索?q=d
search?q=d
搜索?q=di
search?q=di
搜索?q=din
search?q=din
搜索?q=dinn
search?q=dinn
搜索?q=晚餐
search?q=dinne
搜索?q=晚餐
search?q=dinner
•每秒约24,000 次查询(QPS) = 10,000,000 个用户* 10 次查询/天* 20 个字符/24 小时/3600 秒。
•~24,000 query per second (QPS) = 10,000,000 users * 10 queries / day * 20 characters / 24 hours / 3600 seconds.
•峰值 QPS = QPS * 2 = ~48,000
•Peak QPS = QPS * 2 = ~48,000
•假设每日查询中有20% 是新的。1000 万 * 10 个查询/天 * 每个查询 20 字节 * 20% = 0.4 GB。这意味着每天都会有 0.4GB 的新数据添加到存储中。
•Assume 20% of the daily queries are new. 10 million * 10 queries / day * 20 byte per query * 20% = 0.4 GB. This means 0.4GB of new data is added to storage daily.
在高层,该系统分为两部分:
At the high-level, the system is broken down into two:
•数据收集服务:它收集用户输入查询并实时聚合它们。实时处理对于大数据集来说并不实用;然而,这是一个很好的起点。我们将深入探索更现实的解决方案。
•Data gathering service: It gathers user input queries and aggregates them in real-time. Real-time processing is not practical for large data sets; however, it is a good starting point. We will explore a more realistic solution in deep dive.
•查询服务:给定搜索查询或前缀,返回5 个最常搜索的术语。
•Query service: Given a search query or prefix, return 5 most frequently searched terms.
让我们用一个简化的例子来看看数据收集服务是如何工作的。假设我们有一个频率表,用于存储查询字符串及其频率,如图 13-2 所示。一开始,频率表是空的。随后,用户依次输入查询“twitch”、“twitter”、“twitter”和“twillo”。图13-2显示了频率表是如何更新的。
Let us use a simplified example to see how data gathering service works. Assume we have a frequency table that stores the query string and its frequency as shown in Figure 13-2. In the beginning, the frequency table is empty. Later, users enter queries “twitch”, “twitter”, “twitter,” and “twillo” sequentially. Figure 13-2 shows how the frequency table is updated.
假设我们有一个频率表,如表 13-1 所示。它有两个字段。
Assume we have a frequency table as shown in Table 13-1. It has two fields.
•查询:它存储查询字符串。
•Query: it stores the query string.
•频率:表示查询被搜索的次数。
•Frequency: it represents the number of times a query has been searched.
当用户在搜索框中键入“tw”时,将显示以下前 5 个搜索查询(图 13-3),假设频率表基于表 13-1。
When a user types “tw” in the search box, the following top 5 searched queries are displayed (Figure 13-3), assuming the frequency table is based on Table 13-1.
要获取前 5 个最常搜索的查询,请执行以下 SQL 查询:
To get top 5 frequently searched queries, execute the following SQL query:
当数据集较小时,这是一个可接受的解决方案。当它很大时,访问数据库就会成为瓶颈。我们将深入探讨优化。
This is an acceptable solution when the data set is small. When it is large, accessing the database becomes a bottleneck. We will explore optimizations in deep dive.
在高层设计中,我们讨论了数据收集服务和查询服务。高层设计并不是最优的,但它是一个很好的起点。在本节中,我们将深入研究几个组件并探索优化,如下所示:
In the high-level design, we discussed data gathering service and query service. The high-level design is not optimal, but it serves as a good starting point. In this section, we will dive deep into a few components and explore optimizations as follows:
• Trie 数据结构
•Trie data structure
•数据收集服务
•Data gathering service
•查询服务
•Query service
•扩展存储
•Scale the storage
•特里树操作
•Trie operations
高层设计中采用关系数据库进行存储。然而,从关系数据库中获取前 5 个搜索查询的效率很低。数据结构trie(前缀树)就是用来克服这个问题的。由于 trie 数据结构对于系统至关重要,因此我们将花费大量时间来设计定制的 trie。请注意,一些想法来自文章 [2] 和 [3]。
Relational databases are used for storage in the high-level design. However, fetching the top 5 search queries from a relational database is inefficient. The data structure trie (prefix tree) is used to overcome the problem. As trie data structure is crucial for the system, we will dedicate significant time to design a customized trie. Please note that some of the ideas are from articles [2] and [3].
了解基本的 trie 数据结构对于这个面试问题至关重要。然而,这更多的是一个数据结构问题,而不是系统设计问题。此外,很多网上资料都解释了这个概念。在本章中,我们仅讨论 trie 数据结构的概述,并重点讨论如何优化基本 trie 以提高响应时间。
Understanding the basic trie data structure is essential for this interview question. However, this is more of a data structure question than a system design question. Besides, many online materials explain this concept. In this chapter, we will only discuss an overview of the trie data structure and focus on how to optimize the basic trie to improve response time.
Trie(发音为“try”)是一种可以紧凑地存储字符串的树状数据结构。该名称来自单词 re trie val,这表明它是为字符串检索操作而设计的。trie的主要思想包括以下内容:
Trie (pronounced “try”) is a tree-like data structure that can compactly store strings. The name comes from the word retrieval, which indicates it is designed for string retrieval operations. The main idea of trie consists of the following:
• trie 是一种树状数据结构。
•A trie is a tree-like data structure.
•根代表空字符串。
•The root represents an empty string.
•每个节点存储一个字符并具有 26 个子节点,每个节点对应一个可能的字符。为了节省空间,我们不绘制空链接。
•Each node stores a character and has 26 children, one for each possible character. To save space, we do not draw empty links.
•每个树节点代表一个单词或一个前缀字符串。
•Each tree node represents a single word or a prefix string.
图 13-5 显示了一个包含搜索查询“tree”、“try”、“true”、“toy”、“wish”、“win”的 trie。搜索查询以较粗的边框突出显示。
Figure 13-5 shows a trie with search queries “tree”, “try”, “true”, “toy”, “wish”, “win”. Search queries are highlighted with a thicker border.
基本的 trie 数据结构在节点中存储字符。为了支持按频率排序,频率信息需要包含在节点中。假设我们有以下频率表。
Basic trie data structure stores characters in nodes. To support sorting by frequency, frequency info needs to be included in nodes. Assume we have the following frequency table.
向节点添加频率信息后,更新后的 trie 数据结构如图 13-6 所示。
After adding frequency info to nodes, updated trie data structure is shown in Figure 13-6.
自动完成功能如何与 trie 一起使用?在深入研究该算法之前,让我们定义一些术语。
How does autocomplete work with trie? Before diving into the algorithm, let us define some terms.
• p :前缀的长度
•p: length of a prefix
• n:trie 中的节点总数
•n: total number of nodes in a trie
• c:给定节点的子节点数
•c: number of children of a given node
下面列出了获取前k 个搜索最多的查询的步骤:
Steps to get top k most searched queries are listed below:
1. 查找前缀。时间复杂度:O(p) 。
1. Find the prefix. Time complexity: O(p).
2. 从前缀节点开始遍历子树,得到所有有效的子节点。如果子级可以形成有效的查询字符串,则该子级是有效的。时间复杂度:O(c)
2. Traverse the subtree from the prefix node to get all valid children. A child is valid if it can form a valid query string. Time complexity: O(c)
3. 对孩子们进行排序并获得前k 名。时间复杂度:O(clogc)
3. Sort the children and get top k. Time complexity: O(clogc)
让我们用图13-7所示的例子来解释该算法。假设k等于 2,并且用户在搜索框中键入“tr”。该算法的工作原理如下:
Let us use an example as shown in Figure 13-7 to explain the algorithm. Assume k equals to 2 and a user types “tr” in the search box. The algorithm works as follows:
•步骤1:找到前缀节点“tr”。
•Step 1: Find the prefix node “tr”.
•步骤2:遍历子树以获取所有有效的子树。在这种情况下,节点 [tree: 10]、[true: 35]、[try: 29] 是有效的。
•Step 2: Traverse the subtree to get all valid children. In this case, nodes [tree: 10], [true: 35], [try: 29] are valid.
•步骤3:对子项进行排序并获取前2 个。[true: 35] 和[try: 29] 是前缀为“tr”的前2 个查询。
•Step 3: Sort the children and get top 2. [true: 35] and [try: 29] are the top 2 queries with prefix “tr”.
图13-7
Figure 13-7
该算法的时间复杂度是上述每个步骤所花费的时间总和:O(p) + O(c) + O(clogc)
The time complexity of this algorithm is the sum of time spent on each step mentioned above: O(p) + O(c) + O(clogc)
上面的算法很简单。然而,它太慢了,因为在最坏的情况下我们需要遍历整个 trie 才能获得前k 个结果。下面是两个优化:
The above algorithm is straightforward. However, it is too slow because we need to traverse the entire trie to get top k results in the worst-case scenario. Below are two optimizations:
1. 限制前缀最大长度
1. Limit the max length of a prefix
2. 在每个节点缓存热门搜索查询
2. Cache top search queries at each node
让我们一一看看这些优化。
Let us look at these optimizations one by one.
用户很少在搜索框中输入很长的搜索查询。因此,可以肯定地说p是一个小整数,比如 50。如果我们限制前缀的长度,“查找前缀”的时间复杂度可以从O(p)降低到O(小常数),又名O(1) 。
Users rarely type a long search query into the search box. Thus, it is safe to say p is a small integer number, say 50. If we limit the length of a prefix, the time complexity for “Find the prefix” can be reduced from O(p) to O(small constant), aka O(1).
为了避免遍历整个 trie,我们在每个节点存储前k个最常用的查询。由于 5 到 10 个自动完成建议对于用户来说就足够了,因此k是一个相对较小的数字。在我们的具体案例中,仅缓存前 5 个搜索查询。
To avoid traversing the whole trie, we store top k most frequently used queries at each node. Since 5 to 10 autocomplete suggestions are enough for users, k is a relatively small number. In our specific case, only the top 5 search queries are cached.
通过在每个节点缓存最热门的搜索查询,我们显着降低了检索前 5 个查询的时间复杂度。然而,这种设计需要大量空间来存储每个节点的热门查询。以空间换时间是非常值得的,因为快速响应时间非常重要。
By caching top search queries at every node, we significantly reduce the time complexity to retrieve the top 5 queries. However, this design requires a lot of space to store top queries at every node. Trading space for time is well worth it as fast response time is very important.
图 13-8 显示了更新后的 trie 数据结构。前 5 个查询存储在每个节点上。例如,前缀为“be”的节点存储以下内容:[best: 35, bet: 29, bee: 20, be: 15, beer: 10]。
Figure 13-8 shows the updated trie data structure. Top 5 queries are stored on each node. For example, the node with prefix “be” stores the following: [best: 35, bet: 29, bee: 20, be: 15, beer: 10].
让我们在应用这两种优化后重新审视算法的时间复杂度:
Let us revisit the time complexity of the algorithm after applying those two optimizations:
1.找到前缀节点。时间复杂度:O(1)
1. Find the prefix node. Time complexity: O(1)
2.返回前k个。由于前k 个查询已被缓存,因此此步骤的时间复杂度为O(1) 。
2. Return top k. Since top k queries are cached, the time complexity for this step is O(1).
由于每个步骤的时间复杂度降低到O(1) ,我们的算法只需要O(1 )即可获取前k 个查询。
As the time complexity for each of the steps is reduced to O(1), our algorithm takes only O(1) to fetch top k queries.
在我们之前的设计中,每当用户输入搜索查询时,数据都会实时更新。由于以下两个原因,这种方法并不实用:
In our previous design, whenever a user types a search query, data is updated in real-time. This approach is not practical for the following two reasons:
•用户每天可能输入数十亿条查询。在每个查询上更新 trie 会显着减慢查询服务的速度。
•Users may enter billions of queries per day. Updating the trie on every query significantly slows down the query service.
•一旦构建了特里树,最重要的建议可能不会发生太大变化。因此,没有必要经常更新trie。
•Top suggestions may not change much once the trie is built. Thus, it is unnecessary to update the trie frequently.
为了设计可扩展的数据收集服务,我们研究数据的来源以及数据的使用方式。像 Twitter 这样的实时应用程序需要最新的自动完成建议。然而,许多 Google 关键字的自动完成建议可能每天都不会发生太大变化。
To design a scalable data gathering service, we examine where data comes from and how data is used. Real-time applications like Twitter require up to date autocomplete suggestions. However, autocomplete suggestions for many Google keywords might not change much on a daily basis.
尽管用例存在差异,但数据收集服务的底层基础保持不变,因为用于构建 trie 的数据通常来自分析或日志记录服务。
Despite the differences in use cases, the underlying foundation for data gathering service remains the same because data used to build the trie is usually from analytics or logging services.
图13-9展示了重新设计的数据收集服务。每个组件都被一一检查。
Figure 13-9 shows the redesigned data gathering service. Each component is examined one by one.
分析日志。它存储有关搜索查询的原始数据。日志仅可追加且未建立索引。表 13-3 显示了日志文件的示例。
Analytics Logs. It stores raw data about search queries. Logs are append-only and are not indexed. Table 13-3 shows an example of the log file.
聚合器。分析日志的大小通常非常大,并且数据的格式不正确。我们需要聚合数据,以便我们的系统可以轻松处理这些数据。
Aggregators. The size of analytics logs is usually very large, and data is not in the right format. We need to aggregate data so it can be easily processed by our system.
根据用例,我们可能会以不同的方式聚合数据。对于 Twitter 等实时应用程序,我们会在较短的时间间隔内聚合数据,因为实时结果很重要。另一方面,对于许多用例来说,聚合数据的频率较低(例如每周一次)可能就足够了。在面试过程中,验证实时结果是否重要。我们假设 trie 每周重建一次。
Depending on the use case, we may aggregate data differently. For real-time applications such as Twitter, we aggregate data in a shorter time interval as real-time results are important. On the other hand, aggregating data less frequently, say once per week, might be good enough for many use cases. During an interview session, verify whether real-time results are important. We assume trie is rebuilt weekly.
表 13-4 显示了每周汇总数据的示例。“时间”字段代表一周的开始时间。“频率”字段是该周相应查询出现次数的总和。
Table 13-4 shows an example of aggregated weekly data. “time” field represents the start time of a week. “frequency” field is the sum of the occurrences for the corresponding query in that week.
工人。Worker 是一组定期执行异步作业的服务器。他们构建 trie 数据结构并将其存储在 Trie DB 中。
Workers. Workers are a set of servers that perform asynchronous jobs at regular intervals. They build the trie data structure and store it in Trie DB.
特里缓存。Trie Cache 是一个分布式缓存系统,它将 trie 保存在内存中以便快速读取。它每周拍摄数据库快照。
Trie Cache. Trie Cache is a distributed cache system that keeps trie in memory for fast read. It takes a weekly snapshot of the DB.
特里数据库。Trie DB 是持久性存储。有两个选项可用于存储数据:
Trie DB. Trie DB is the persistent storage. Two options are available to store the data:
1. 文档存储:由于每周都会构建一个新的 trie,因此我们可以定期对其进行快照,并将其序列化,并将序列化的数据存储在数据库中。像 MongoDB [4] 这样的文档存储非常适合序列化数据。
1. Document store: Since a new trie is built weekly, we can periodically take a snapshot of it, serialize it, and store the serialized data in the database. Document stores like MongoDB [4] are good fits for serialized data.
2.键值存储:特里树可以通过应用以下逻辑以哈希表形式[4]表示:
2. Key-value store: A trie can be represented in a hash table form [4] by applying the following logic:
• trie 中的每个前缀都映射到哈希表中的一个键。
•Every prefix in the trie is mapped to a key in a hash table.
•每个 trie 节点上的数据都映射到哈希表中的一个值。
•Data on each trie node is mapped to a value in a hash table.
图13-10显示了trie和哈希表之间的映射。
Figure 13-10 shows the mapping between the trie and hash table.
在图13-10中,左边的每个trie节点都映射到右边的<key, value >对。如果您不清楚键值存储如何工作,请参阅第 6 章:设计键值存储。
In Figure 13-10, each trie node on the left is mapped to the <key, value> pair on the right. If you are unclear how key-value stores work, refer to Chapter 6: Design a key-value store.
在高层设计中,查询服务直接调用数据库来获取前 5 个结果。图13-11显示了改进的设计,因为以前的设计效率低下。
In the high-level design, query service calls the database directly to fetch the top 5 results. Figure 13-11 shows the improved design as previous design is inefficient.
1. 搜索查询发送到负载均衡器。
1. A search query is sent to the load balancer.
2. 负载均衡器将请求路由到 API 服务器。
2. The load balancer routes the request to API servers.
3. API 服务器从 Trie Cache 获取 trie 数据,并为客户端构建自动完成建议。
3. API servers get trie data from Trie Cache and construct autocomplete suggestions for the client.
4. 如果数据不在 Trie Cache 中,我们将数据补充回缓存。这样,所有后续对相同前缀的请求都会从缓存中返回。当缓存服务器内存不足或脱机时,可能会发生缓存未命中。
4. In case the data is not in Trie Cache, we replenish data back to the cache. This way, all subsequent requests for the same prefix are returned from the cache. A cache miss can happen when a cache server is out of memory or offline.
查询服务需要闪电般的速度。我们提出以下优化建议:
Query service requires lightning-fast speed. We propose the following optimizations:
• AJAX 请求。对于 Web 应用程序,浏览器通常发送 AJAX 请求来获取自动完成结果。AJAX 的主要好处是发送/接收请求/响应不会刷新整个网页。
•AJAX request. For web applications, browsers usually send AJAX requests to fetch autocomplete results. The main benefit of AJAX is that sending/receiving a request/response does not refresh the whole web page.
•浏览器缓存。对于许多应用程序来说,自动完成搜索建议在短时间内可能不会发生太大变化。因此,自动完成建议可以保存在浏览器缓存中,以便后续请求可以直接从缓存中获取结果。Google搜索引擎也使用同样的缓存机制。图13-12显示了当你在Google搜索引擎上输入“system design Interview”时的响应头。如您所见,Google 将结果在浏览器中缓存了 1 小时。请注意:缓存控制中的“私有”意味着结果仅供单个用户使用,不得由共享缓存进行缓存。“max-age=3600”表示缓存的有效期为 3600 秒,即一小时。
•Browser caching. For many applications, autocomplete search suggestions may not change much within a short time. Thus, autocomplete suggestions can be saved in browser cache to allow subsequent requests to get results from the cache directly. Google search engine uses the same cache mechanism. Figure 13-12 shows the response header when you type “system design interview” on the Google search engine. As you can see, Google caches the results in the browser for 1 hour. Please note: “private” in cache-control means results are intended for a single user and must not be cached by a shared cache. “max-age=3600” means the cache is valid for 3600 seconds, aka, an hour.
•数据采样:对于大型系统,记录每个搜索查询需要大量的处理能力和存储空间。数据采样很重要。例如,系统仅记录每N个请求中的 1 个。
•Data sampling: For a large-scale system, logging every search query requires a lot of processing power and storage. Data sampling is important. For instance, only 1 out of every N requests is logged by the system.
Trie 是自动完成系统的核心组件。让我们看看 trie 操作(创建、更新和删除)是如何工作的。
Trie is a core component of the autocomplete system. Let us look at how trie operations (create, update, and delete) work.
Trie 是由工作人员使用聚合数据创建的。数据来源来自分析日志/DB。
Trie is created by workers using aggregated data. The source of data is from Analytics Log/DB.
有两种更新 trie 的方法。
There are two ways to update the trie.
选项 1:每周更新 trie。一旦创建了新的 trie,新的 trie 就会替换旧的 trie。
Option 1: Update the trie weekly. Once a new trie is created, the new trie replaces the old one.
选项 2:直接更新单个 trie 节点。我们尽量避免这个操作,因为它很慢。然而,如果 trie 的大小很小,那么这是一个可以接受的解决方案。当我们更新一个 trie 节点时,它的祖先一直到根都必须更新,因为祖先存储了子节点的顶级查询。图 13-13显示了更新操作如何工作的示例。在左侧,搜索查询“beer”的原始值是 10。在右侧,它已更新为 30。如您所见,节点及其祖先的“beer”值已更新为 30。
Option 2: Update individual trie node directly. We try to avoid this operation because it is slow. However, if the size of the trie is small, it is an acceptable solution. When we update a trie node, its ancestors all the way up to the root must be updated because ancestors store top queries of children. Figure 13-13 shows an example of how the update operation works. On the left side, the search query “beer” has the original value 10. On the right side, it is updated to 30. As you can see, the node and its ancestors have the “beer” value updated to 30.
我们必须删除仇恨、暴力、色情或危险的自动完成建议。我们在 Trie Cache 前面添加一个过滤层(图 13-14)来过滤掉不需要的建议。过滤层使我们可以灵活地根据不同的过滤规则删除结果。不需要的建议会从数据库中异步物理删除,以便在下一个更新周期中使用正确的数据集来构建 trie。
We have to remove hateful, violent, sexually explicit, or dangerous autocomplete suggestions. We add a filter layer (Figure 13-14) in front of the Trie Cache to filter out unwanted suggestions. Having a filter layer gives us the flexibility of removing results based on different filter rules. Unwanted suggestions are removed physically from the database asynchronically so the correct data set will be used to build trie in the next update cycle.
现在我们已经开发了一个为用户提供自动完成查询的系统,现在是时候解决当 trie 变得太大而无法容纳一台服务器时的可扩展性问题了。
Now that we have developed a system to bring autocomplete queries to users, it is time to solve the scalability issue when the trie grows too large to fit in one server.
由于英语是唯一受支持的语言,因此一种简单的分片方法是基于第一个字符。这里有些例子。
Since English is the only supported language, a naive way to shard is based on the first character. Here are some examples.
•如果我们需要两台服务器进行存储,我们可以在第一台服务器上存储以“ a ”到“ m ”开头的查询,在第二个服务器上存储以“ n ”到“ z ”开头的查询。
•If we need two servers for storage, we can store queries starting with ‘a’ to ‘m’ on the first server, and ‘n’ to ‘z’ on the second server.
•如果我们需要三个服务器,我们可以将查询拆分为“ a ”到“ i ”、“ j ”到“ r ”以及“ s ”到“ z ”。
•If we need three servers, we can split queries into ‘a’ to ‘i’, ‘j’ to ‘r’ and ‘s’ to ‘z’.
按照这个逻辑,我们可以将查询拆分为最多 26 个服务器,因为英语有 26 个字母字符。让我们将基于第一个字符的分片定义为第一级分片。为了存储超过26台服务器的数据,我们可以在第二级甚至第三级进行分片。例如,以“ a ”开头的数据查询可以分为4个服务器:“ aa-ag ”、“ ah-an ”、“ ao-au ”和“ av-az” 。
Following this logic, we can split queries up to 26 servers because there are 26 alphabetic characters in English. Let us define sharding based on the first character as first level sharding. To store data beyond 26 servers, we can shard on the second or even at the third level. For example, data queries that start with ‘a’ can be split into 4 servers: ‘aa-ag’, ‘ah-an’, ‘ao-au’, and ‘av-az’.
乍一看,这种方法似乎很合理,直到您意识到以字母“ c ”开头的单词比“ x ”多得多。这造成分布不均匀。
At the first glance this approach seems reasonable, until you realize that there are a lot more words that start with the letter ‘c’ than ‘x’. This creates uneven distribution.
为了缓解数据不平衡问题,我们分析历史数据分布模式并应用更智能的分片逻辑,如图13-15所示。分片映射管理器维护一个查找数据库,用于识别应存储行的位置。例如,如果“ s ”和“ u ”、“ v ”、“ w ”、“ x ”、“ y ”和“ z ”组合的历史查询数量相似,我们可以维护两个分片:一个一个用于“s”,一个用于“u”到“z”。
To mitigate the data imbalance problem, we analyze historical data distribution pattern and apply smarter sharding logic as shown in Figure 13-15. The shard map manager maintains a lookup database for identifying where rows should be stored. For example, if there are a similar number of historical queries for ‘s’ and for ‘u’, ‘v’, ‘w’, ‘x’, ‘y’ and ‘z’ combined, we can maintain two shards: one for ‘s’ and one for ‘u’ to ‘z’.
完成深入探讨后,面试官可能会问你一些后续问题。
After you finish the deep dive, your interviewer might ask you some follow up questions.
采访者:您如何扩展您的设计以支持多种语言?
Interviewer: How do you extend your design to support multiple languages?
为了支持其他非英语查询,我们将 Unicode 字符存储在 trie 节点中。如果您不熟悉 Unicode,这里是定义:“编码标准涵盖了世界上所有书写系统(现代和古代)的所有字符”[5]。
To support other non-English queries, we store Unicode characters in trie nodes. If you are not familiar with Unicode, here is the definition: “an encoding standard covers all the characters for all the writing systems of the world, modern and ancient” [5].
采访者:如果一个国家/地区的热门搜索查询与其他国家/地区不同怎么办?
Interviewer: What if top search queries in one country are different from others?
在这种情况下,我们可能会为不同的国家建立不同的尝试。为了缩短响应时间,我们可以将尝试存储在 CDN 中。
In this case, we might build different tries for different countries. To improve the response time, we can store tries in CDNs.
采访者:我们如何支持趋势(实时)搜索查询?
Interviewer: How can we support the trending (real-time) search queries?
假设新闻事件爆发,搜索查询突然变得流行。我们最初的设计将行不通,因为:
Assuming a news event breaks out, a search query suddenly becomes popular. Our original design will not work because:
•离线工作人员尚未计划更新特里结构,因为计划每周运行一次。
•Offline workers are not scheduled to update the trie yet because this is scheduled to run on weekly basis.
•即使安排好了,构建特里树也需要很长时间。
•Even if it is scheduled, it takes too long to build the trie.
构建实时搜索自动完成功能很复杂,超出了本书的范围,因此我们只会给出一些想法:
Building a real-time search autocomplete is complicated and is beyond the scope of this book so we will only give a few ideas:
•通过分片减少工作数据集。
•Reduce the working data set by sharding.
•更改排名模型并为最近的搜索查询分配更多权重。
•Change the ranking model and assign more weight to recent search queries.
•数据可能以流的形式出现,因此我们无法立即访问所有数据。流数据意味着数据是连续生成的。流处理需要一组不同的系统:Apache Hadoop MapReduce [6]、Apache Spark Streaming [7]、Apache Storm [8]、Apache Kafka [9] 等。因为所有这些主题都需要特定的领域知识,所以我们不会详细介绍在这里。
•Data may come as streams, so we do not have access to all the data at once. Streaming data means data is generated continuously. Stream processing requires a different set of systems: Apache Hadoop MapReduce [6], Apache Spark Streaming [7], Apache Storm [8], Apache Kafka [9], etc. Because all those topics require specific domain knowledge, we are not going into detail here.
恭喜您已经走到这一步了!现在拍拍自己的背吧。好工作!
Congratulations on getting this far! Now give yourself a pat on the back. Good job!
参考资料
Reference materials
[1] Typeahead 查询的生命周期:https://www.facebook.com/notes/facebook-engineering/the-life-of-a-typeahead-query/389105248919/
[1] The Life of a Typeahead Query: https://www.facebook.com/notes/facebook-engineering/the-life-of-a-typeahead-query/389105248919/
[2] 我们如何构建 Prefixy:用于支持自动完成的可扩展前缀搜索服务:https://medium.com/@prefixyteam/how-we-built-prefixy-a-scalable-prefix-search-service-for-powering-自动完成-c20f98e2eff1
[2] How We Built Prefixy: A Scalable Prefix Search Service for Powering Autocomplete: https://medium.com/@prefixyteam/how-we-built-prefixy-a-scalable-prefix-search-service-for-powering-autocomplete-c20f98e2eff1
[3] 前缀哈希树分布式哈希表上的索引数据结构:https://people.eecs.berkeley.edu/~sylvia/papers/pht.pdf
[3] Prefix Hash Tree An Indexing Data Structure over Distributed Hash Tables: https://people.eecs.berkeley.edu/~sylvia/papers/pht.pdf
[4] MongoDB 维基百科:https://en.wikipedia.org/wiki/MongoDB
[4] MongoDB wikipedia: https://en.wikipedia.org/wiki/MongoDB
[5] Unicode常见问题:https://www.unicode.org/faq/basic_q.html
[5] Unicode frequently asked questions: https://www.unicode.org/faq/basic_q.html
[6]阿帕奇hadoop: https: //hadoop.apache.org/
[6] Apache hadoop: https://hadoop.apache.org/
[7] Spark 流: https: //spark.apache.org/streaming/
[7] Spark streaming: https://spark.apache.org/streaming/
[8]阿帕奇风暴: https: //storm.apache.org/
[8] Apache storm: https://storm.apache.org/
[9]阿帕奇卡夫卡:https://kafka.apache.org/documentation/
[9] Apache kafka: https://kafka.apache.org/documentation/
在本章中,您需要设计 YouTube。这个问题的解决方案可以应用于其他面试问题,例如设计 Netflix 和 Hulu 等视频共享平台。YouTube主页如图14-1所示。
In this chapter, you are asked to design YouTube. The solution to this question can be applied to other interview questions like designing a video sharing platform such as Netflix and Hulu. Figure 14-1 shows the YouTube homepage.
YouTube 看起来很简单:内容创建者上传视频,观众点击播放。真的有那么简单吗?并不真地。简单的背后隐藏着许多复杂的技术。让我们看一下 2020 年 YouTube 的一些令人印象深刻的统计数据、人口统计数据和有趣的事实 [1] [2]。
YouTube looks simple: content creators upload videos and viewers click play. Is it really that simple? Not really. There are lots of complex technologies underneath the simplicity. Let us look at some impressive statistics, demographics, and fun facts of YouTube in 2020 [1] [2].
•月活跃用户总数:20 亿。
•Total number of monthly active users: 2 billion.
•每天观看的视频数量:50 亿。
•Number of videos watched per day: 5 billion.
• 73% 的美国成年人使用YouTube。
•73% of US adults use YouTube.
• YouTube 上有 5000 万创作者。
•50 million creators on YouTube.
• YouTube 2019 年全年广告收入为 151 亿美元,较 2018 年增长 36%。
•YouTube’s Ad revenue was $15.1 billion for the full year 2019, up 36% from 2018.
• YouTube 占所有移动互联网流量的37%。
•YouTube is responsible for 37% of all mobile internet traffic.
• YouTube 有 80 种不同的语言版本。
•YouTube is available in 80 different languages.
从这些统计数据中,我们知道 YouTube 规模庞大、全球化,而且赚了很多钱。
From these statistics, we know YouTube is enormous, global and makes a lot of money.
如图14-1所示,除了观看视频之外,您还可以在YouTube上做更多事情。例如,评论、分享或点赞视频、将视频保存到播放列表、订阅频道等。不可能在 45 或 60 分钟的采访中设计所有内容。因此,提出问题以缩小范围很重要。
As revealed in Figure 14-1, besides watching a video, you can do a lot more on YouTube. For example, comment, share, or like a video, save a video to playlists, subscribe to a channel, etc. It is impossible to design everything within a 45- or 60-minute interview. Thus, it is important to ask questions to narrow down the scope.
应聘者:什么特征是重要的?
Candidate: What features are important?
面试官:能够上传视频并观看视频。
Interviewer: Ability to upload a video and watch a video.
应聘者:我们需要支持哪些客户?
Candidate: What clients do we need to support?
面试官:移动应用程序、网络浏览器和智能电视。
Interviewer: Mobile apps, web browsers, and smart TV.
候选人:我们有多少每日活跃用户?
Candidate: How many daily active users do we have?
采访者:500万
Interviewer: 5 million
应聘者:每天平均花在该产品上的时间是多少?
Candidate: What is the average daily time spent on the product?
面试官:30分钟。
Interviewer: 30 minutes.
候选人:我们需要支持国际用户吗?
Candidate: Do we need to support international users?
采访者:是的,很大一部分用户是国际用户。
Interviewer: Yes, a large percentage of users are international users.
考生:支持的视频分辨率是多少?
Candidate: What are the supported video resolutions?
面试官:系统接受大部分的视频分辨率和格式。
Interviewer: The system accepts most of the video resolutions and formats.
考生:需要加密吗?
Candidate: Is encryption required?
面试官:是的
Interviewer: Yes
候选人:视频的文件大小有要求吗?
Candidate: Any file size requirement for videos?
面试官:我们的平台主打中小视频。允许的最大视频大小为 1GB。
Interviewer: Our platform focuses on small and medium-sized videos. The maximum allowed video size is 1GB.
候选人:我们可以利用亚马逊、谷歌或微软提供的一些现有云基础设施吗?
Candidate: Can we leverage some of the existing cloud infrastructures provided by Amazon, Google, or Microsoft?
采访者:这是一个很好的问题。对于大多数公司来说,从头开始构建一切都是不现实的,建议利用一些现有的云服务。
Interviewer: That is a great question. Building everything from scratch is unrealistic for most companies, it is recommended to leverage some of the existing cloud services.
在本章中,我们重点设计具有以下功能的视频流服务:
In the chapter, we focus on designing a video streaming service with the following features:
•能够快速上传视频
•Ability to upload videos fast
•流畅的视频流
•Smooth video streaming
•能够更改视频质量
•Ability to change video quality
•基础设施成本低
•Low infrastructure cost
•高可用性、可扩展性和可靠性要求
•High availability, scalability, and reliability requirements
•支持的客户端:移动应用程序、网络浏览器和智能电视
•Clients supported: mobile apps, web browser, and smart TV
以下估计基于许多假设,因此与面试官沟通以确保她意见一致非常重要。
The following estimations are based on many assumptions, so it is important to communicate with the interviewer to make sure she is on the same page.
•假设该产品有500 万每日活跃用户(DAU)。
•Assume the product has 5 million daily active users (DAU).
•用户每天观看5 个视频。
•Users watch 5 videos per day.
• 10% 的用户每天上传1 个视频。
•10% of users upload 1 video per day.
•假设平均视频大小为300 MB。
•Assume the average video size is 300 MB.
•每日所需总存储空间:500 万 * 10% * 300 MB = 150TB
•Total daily storage space needed: 5 million * 10% * 300 MB = 150TB
• CDN 成本。
•CDN cost.
•当云CDN 提供视频时,您需要为从CDN 传出的数据付费。
•When cloud CDN serves a video, you are charged for data transferred out of the CDN.
•让我们使用Amazon 的CDN CloudFront 进行成本估算(图14-2)[3]。假设 100% 的流量来自美国。每 GB 的平均成本为 0.02 美元。为了简单起见,我们只计算视频流的成本。
•Let us use Amazon’s CDN CloudFront for cost estimation (Figure 14-2) [3]. Assume 100% of traffic is served from the United States. The average cost per GB is $0.02. For simplicity, we only calculate the cost of video streaming.
•每天 500 万个 * 5 个视频 * 0.3GB * 0.02 美元 = 150,000 美元。
•5 million * 5 videos * 0.3GB * $0.02 = $150,000 per day.
从粗略的成本估算来看,我们知道从 CDN 提供视频需要花费大量资金。尽管云提供商愿意为大客户大幅降低 CDN 成本,但成本仍然很高。我们将深入讨论降低 CDN 成本的方法。
From the rough cost estimation, we know serving videos from the CDN costs lots of money. Even though cloud providers are willing to lower the CDN costs significantly for big customers, the cost is still substantial. We will discuss ways to reduce CDN costs in deep dive.
正如前面所讨论的,面试官建议利用现有的云服务,而不是从头开始构建一切。CDN 和 blob 存储是我们将利用的云服务。有些读者可能会问为什么不自己构建一切呢?原因如下:
As discussed previously, the interviewer recommended leveraging existing cloud services instead of building everything from scratch. CDN and blob storage are the cloud services we will leverage. Some readers might ask why not building everything by ourselves? Reasons are listed below:
•系统设计面试并不是要从头开始构建一切。在有限的时间范围内,选择正确的技术来正确地完成工作比解释技术的详细工作原理更重要。例如,提及用于存储源视频的 blob 存储对于面试来说就足够了。谈论 blob 存储的详细设计可能有些过分了。
•System design interviews are not about building everything from scratch. Within the limited time frame, choosing the right technology to do a job right is more important than explaining how the technology works in detail. For instance, mentioning blob storage for storing source videos is enough for the interview. Talking about the detailed design for blob storage could be an overkill.
•构建可扩展的blob 存储或CDN 极其复杂且成本高昂。即使像 Netflix 或 Facebook 这样的大公司也不会自己构建所有东西。Netflix 利用 Amazon 的云服务 [4],Facebook 使用 Akamai 的 CDN [5]。
•Building scalable blob storage or CDN is extremely complex and costly. Even large companies like Netflix or Facebook do not build everything themselves. Netflix leverages Amazon’s cloud services [4], and Facebook uses Akamai’s CDN [5].
从高层来看,该系统由三个组件组成(图 14-3)。
At the high-level, the system comprises three components (Figure 14-3).
客户端:您可以在电脑、手机、智能电视上观看YouTube。
Client: You can watch YouTube on your computer, mobile phone, and smartTV.
CDN :视频存储在CDN中。当您按下播放键时,视频将从 CDN 流式传输。
CDN: Videos are stored in CDN. When you press play, a video is streamed from the CDN.
API 服务器:除了视频流之外的所有其他内容都通过 API 服务器。这包括提要推荐、生成视频上传 URL、更新元数据数据库和缓存、用户注册等。
API servers: Everything else except video streaming goes through API servers. This includes feed recommendation, generating video upload URL, updating metadata database and cache, user signup, etc.
在问答环节中,面试官对两个流程表现出了兴趣:
In the question/answer session, the interviewer showed interests in two flows:
•视频上传流程
•Video uploading flow
•视频流媒体流
•Video streaming flow
We will explore the high-level design for each of them.
图 14-4 显示了视频上传的高级设计。
Figure 14-4 shows the high-level design for the video uploading.
它由以下组件组成:
It consists of the following components:
•用户:用户在计算机、手机或智能电视等设备上观看YouTube。
•User: A user watches YouTube on devices such as a computer, mobile phone, or smart TV.
•负载均衡器:负载均衡器在API 服务器之间均匀分配请求。
•Load balancer: A load balancer evenly distributes requests among API servers.
• API 服务器:除视频流外,所有用户请求都通过API 服务器。
•API servers: All user requests go through API servers except video streaming.
•元数据数据库:视频元数据存储在元数据数据库中。它被分片和复制以满足性能和高可用性要求。
•Metadata DB: Video metadata are stored in Metadata DB. It is sharded and replicated to meet performance and high availability requirements.
•元数据缓存:为了获得更好的性能,视频元数据和用户对象被缓存。
•Metadata cache: For better performance, video metadata and user objects are cached.
•原始存储:Blob 存储系统用于存储原始视频。维基百科中有关 blob 存储的引用表明:“二进制大对象 (BLOB) 是作为单个实体存储在数据库管理系统中的二进制数据的集合”[6]。
•Original storage: A blob storage system is used to store original videos. A quotation in Wikipedia regarding blob storage shows that: “A Binary Large Object (BLOB) is a collection of binary data stored as a single entity in a database management system” [6].
•转码服务器:视频转码也称为视频编码。它是将视频格式转换为其他格式(MPEG、HLS 等)的过程,为不同的设备和带宽能力提供最佳的视频流。
•Transcoding servers: Video transcoding is also called video encoding. It is the process of converting a video format to other formats (MPEG, HLS, etc), which provide the best video streams possible for different devices and bandwidth capabilities.
•转码存储:它是存储转码视频文件的blob 存储。
•Transcoded storage: It is a blob storage that stores transcoded video files.
• CDN:视频缓存在CDN 中。单击播放按钮时,将从 CDN 传输视频。
•CDN: Videos are cached in CDN. When you click the play button, a video is streamed from the CDN.
•完成队列:是一个消息队列,存储视频转码完成事件的信息。
•Completion queue: It is a message queue that stores information about video transcoding completion events.
•完成处理程序:这由一系列工作人员组成,这些工作人员从完成队列中提取事件数据并更新元数据缓存和数据库。
•Completion handler: This consists of a list of workers that pull event data from the completion queue and update metadata cache and database.
现在我们已经分别了解了每个组件,让我们检查一下视频上传流程的工作原理。该流程被分解为两个并行运行的进程。
Now that we understand each component individually, let us examine how the video uploading flow works. The flow is broken down into two processes running in parallel.
A。上传实际视频。
a. Upload the actual video.
b. 更新视频元数据。元数据包含有关视频 URL、大小、分辨率、格式、用户信息等信息。
b. Update video metadata. Metadata contains information about video URL, size, resolution, format, user info, etc.
实际视频上传如图14-5所示。解释如下:
Figure 14-5 shows how to upload the actual video. The explanation is shown below:
1. 视频上传至原存储。
1. Videos are uploaded to the original storage.
2. 转码服务器从原始存储中获取视频并开始转码。
2. Transcoding servers fetch videos from the original storage and start transcoding.
3. 转码完成后,将并行执行以下两个步骤:
3. Once transcoding is complete, the following two steps are executed in parallel:
3a. 转码后的视频将发送到转码存储。
3a. Transcoded videos are sent to transcoded storage.
3b. 转码完成事件在完成队列中排队。
3b. Transcoding completion events are queued in the completion queue.
3a.1. 转码后的视频分发到 CDN。
3a.1. Transcoded videos are distributed to CDN.
3b.1. 完成处理程序包含一群不断从队列中提取事件数据的工作人员。
3b.1. Completion handler contains a bunch of workers that continuously pull event data from the queue.
3b.1.a. 和 3b.1.b。当视频转码完成时,完成处理程序会更新元数据数据库和缓存。
3b.1.a. and 3b.1.b. Completion handler updates the metadata database and cache when video transcoding is complete.
4. API 服务器通知客户端视频已成功上传并准备好进行流式传输。
4. API servers inform the client that the video is successfully uploaded and is ready for streaming.
当文件上传到原始存储时,客户端并行发送更新视频元数据的请求,如图14-6所示。该请求包含视频元数据,包括文件名、大小、格式等。API 服务器更新元数据缓存和数据库。
While a file is being uploaded to the original storage, the client in parallel sends a request to update the video metadata as shown in Figure 14-6. The request contains video metadata, including file name, size, format, etc. API servers update the metadata cache and database.
每当您在 YouTube 上观看视频时,它通常会立即开始流式传输,而无需等到整个视频下载完毕。下载意味着将整个视频复制到您的设备,而流式传输意味着您的设备不断接收来自远程源视频的视频流。当您观看流媒体视频时,您的客户端一次加载一点数据,以便您可以立即连续观看视频。
Whenever you watch a video on YouTube, it usually starts streaming immediately and you do not wait until the whole video is downloaded. Downloading means the whole video is copied to your device, while streaming means your device continuously receives video streams from remote source videos. When you watch streaming videos, your client loads a little bit of data at a time so you can watch videos immediately and continuously.
在讨论视频流媒体流程之前,我们先看一个重要的概念:流媒体协议。这是控制视频流数据传输的标准化方法。流行的流媒体协议有:
Before we discuss video streaming flow, let us look at an important concept: streaming protocol. This is a standardized way to control data transfer for video streaming. Popular streaming protocols are:
• MPEG-DASH。MPEG 代表“运动图像专家组”,DASH 代表“HTTP 上的动态自适应流媒体”。
•MPEG–DASH. MPEG stands for “Moving Picture Experts Group” and DASH stands for "Dynamic Adaptive Streaming over HTTP".
•苹果HLS。HLS 代表“HTTP 直播”。
•Apple HLS. HLS stands for “HTTP Live Streaming”.
• Microsoft 平滑流媒体。
•Microsoft Smooth Streaming.
• Adobe HTTP 动态流(HDS)。
•Adobe HTTP Dynamic Streaming (HDS).
您不需要完全理解甚至记住这些流协议名称,因为它们是需要特定领域知识的低级细节。这里重要的是要了解不同的流媒体协议支持不同的视频编码和播放播放器。当我们设计视频流服务时,我们必须选择正确的流协议来支持我们的用例。要了解有关流协议的更多信息,请参阅一篇优秀文章 [7]。
You do not need to fully understand or even remember those streaming protocol names as they are low-level details that require specific domain knowledge. The important thing here is to understand that different streaming protocols support different video encodings and playback players. When we design a video streaming service, we have to choose the right streaming protocol to support our use cases. To learn more about streaming protocols, here is an excellent article [7].
视频直接从 CDN 流式传输。距离您最近的边缘服务器将传送视频。因此,延迟非常小。图 14-7 显示了视频流的高级设计。
Videos are streamed from CDN directly. The edge server closest to you will deliver the video. Thus, there is very little latency. Figure 14-7 shows a high level of design for video streaming.
在高层设计中,整个系统分为两部分:视频上传流程和视频流流程。在本节中,我们将通过重要的优化来完善这两个流程,并引入错误处理机制。
In the high-level design, the entire system is broken down in two parts: video uploading flow and video streaming flow. In this section, we will refine both flows with important optimizations and introduce error handling mechanisms.
当您录制视频时,设备(通常是手机或相机)会为视频文件提供某种格式。如果您希望视频在其他设备上流畅播放,则必须将视频编码为兼容的比特率和格式。比特率是随着时间的推移处理比特的速率。较高的比特率通常意味着较高的视频质量。高比特率流需要更多的处理能力和更快的互联网速度。
When you record a video, the device (usually a phone or camera) gives the video file a certain format. If you want the video to be played smoothly on other devices, the video must be encoded into compatible bitrates and formats. Bitrate is the rate at which bits are processed over time. A higher bitrate generally means higher video quality. High bitrate streams need more processing power and fast internet speed.
视频转码很重要,原因如下:
Video transcoding is important for the following reasons:
•原始视频会消耗大量存储空间。以每秒 60 帧录制的长达一小时的高清视频可能会占用数百 GB 的空间。
•Raw video consumes large amounts of storage space. An hour-long high definition video recorded at 60 frames per second can take up a few hundred GB of space.
•许多设备和浏览器仅支持某些类型的视频格式。因此,出于兼容性原因,将视频编码为不同的格式非常重要。
•Many devices and browsers only support certain types of video formats. Thus, it is important to encode a video to different formats for compatibility reasons.
•为了确保用户在观看高质量视频的同时保持流畅的播放,最好向网络带宽较高的用户提供较高分辨率的视频,向网络带宽较低的用户提供较低分辨率的视频。
•To ensure users watch high-quality videos while maintaining smooth playback, it is a good idea to deliver higher resolution video to users who have high network bandwidth and lower resolution video to users who have low bandwidth.
•网络状况可能会发生变化,尤其是在移动设备上。为了保证视频的连续播放,根据网络情况自动或手动切换视频质量对于流畅的用户体验至关重要。
•Network conditions can change, especially on mobile devices. To ensure a video is played continuously, switching video quality automatically or manually based on network conditions is essential for smooth user experience.
多种编码格式可供选择;然而,它们大多数包含两个部分:
Many types of encoding formats are available; however, most of them contain two parts:
•容器:这就像一个包含视频文件、音频和元数据的篮子。您可以通过文件扩展名来判断容器格式,例如 .avi、.mov 或 .mp4。
•Container: This is like a basket that contains the video file, audio, and metadata. You can tell the container format by the file extension, such as .avi, .mov, or .mp4.
•编解码器:这些是压缩和解压缩算法,旨在减小视频大小,同时保持视频质量。最常用的视频编解码器是 H.264、VP9 和 HEVC。
•Codecs: These are compression and decompression algorithms aim to reduce the video size while preserving the video quality. The most used video codecs are H.264, VP9, and HEVC.
对视频进行转码的计算量很大且耗时。此外,不同的内容创作者可能有不同的视频处理要求。例如,一些内容创建者要求在其视频顶部添加水印,一些内容创建者自己提供缩略图,一些内容创建者上传高清视频,而另一些则不需要。
Transcoding a video is computationally expensive and time-consuming. Besides, different content creators may have different video processing requirements. For instance, some content creators require watermarks on top of their videos, some provide thumbnail images themselves, and some upload high definition videos, whereas others do not.
为了支持不同的视频处理管道并保持高度并行性,添加一定程度的抽象并让客户端程序员定义要执行的任务非常重要。例如,Facebook 的流视频引擎使用有向无环图 (DAG) 编程模型,该模型分阶段定义任务,以便它们可以顺序或并行执行 [8]。在我们的设计中,我们采用类似的DAG模型来实现灵活性和并行性。图 14-8 表示用于视频转码的 DAG。
To support different video processing pipelines and maintain high parallelism, it is important to add some level of abstraction and let client programmers define what tasks to execute. For example, Facebook’s streaming video engine uses a directed acyclic graph (DAG) programming model, which defines tasks in stages so they can be executed sequentially or parallelly [8]. In our design, we adopt a similar DAG model to achieve flexibility and parallelism. Figure 14-8 represents a DAG for video transcoding.
在图14-8中,原始视频被分割为视频、音频和元数据。以下是可应用于视频文件的一些任务:
In Figure 14-8, the original video is split into video, audio, and metadata. Here are some of the tasks that can be applied on a video file:
•检查:确保视频质量良好且无格式错误。
•Inspection: Make sure videos have good quality and are not malformed.
•视频编码:视频被转换为支持不同的分辨率、编解码器、比特率等。图14-9 显示了视频编码文件的示例。
•Video encodings: Videos are converted to support different resolutions, codec, bitrates, etc. Figure 14-9 shows an example of video encoded files.
•缩略图。缩略图可以由用户上传,也可以由系统自动生成。
•Thumbnail. Thumbnails can either be uploaded by a user or automatically generated by the system.
•水印:覆盖在视频顶部的图像包含有关视频的识别信息。
•Watermark: An image overlay on top of your video contains identifying information about your video.
建议的利用云服务的视频转码架构如图 14-10 所示。
The proposed video transcoding architecture that leverages the cloud services, is shown in Figure 14-10.
该架构有六个主要组件:预处理器、DAG 调度器、资源管理器、任务工作者、临时存储和作为输出的编码视频。让我们仔细看看每个组件。
The architecture has six main components: preprocessor, DAG scheduler, resource manager, task workers, temporary storage, and encoded video as the output. Let us take a close look at each component.
预处理器有 4 个职责:
The preprocessor has 4 responsibilities:
1.视频分割。视频流被分割或进一步分割成更小的图像组(GOP)对齐。GOP 是按特定顺序排列的一组/块帧。每个块都是一个独立的可播放单元,长度通常为几秒。
1. Video splitting. Video stream is split or further split into smaller Group of Pictures (GOP) alignment. GOP is a group/chunk of frames arranged in a specific order. Each chunk is an independently playable unit, usually a few seconds in length.
2. 某些旧的移动设备或浏览器可能不支持视频分割。预处理器按旧客户端的 GOP 对齐方式分割视频。
2. Some old mobile devices or browsers might not support video splitting. Preprocessor split videos by GOP alignment for old clients.
3. DAG生成。处理器根据客户端程序员编写的配置文件生成 DAG。图 14-12 是一个简化的 DAG 表示,它有 2 个节点和 1 个边:
3. DAG generation. The processor generates DAG based on configuration files client programmers write. Figure 14-12 is a simplified DAG representation which has 2 nodes and 1 edge:
该 DAG 表示是从下面的两个配置文件生成的(图 14-13):
This DAG representation is generated from the two configuration files below (Figure 14-13):
4.缓存数据。预处理器是分段视频的缓存。为了获得更好的可靠性,预处理器将 GOP 和元数据存储在临时存储中。如果视频编码失败,系统可以使用持久数据进行重试操作。
4. Cache data. The preprocessor is a cache for segmented videos. For better reliability, the preprocessor stores GOPs and metadata in temporary storage. If video encoding fails, the system could use persisted data for retry operations.
DAG 调度程序将 DAG 图拆分为任务阶段,并将它们放入资源管理器的任务队列中。图 14-15 显示了 DAG 调度程序如何工作的示例。
The DAG scheduler splits a DAG graph into stages of tasks and puts them in the task queue in the resource manager. Figure 14-15 shows an example of how the DAG scheduler works.
如图14-15所示,原始视频分为2个阶段: 第1阶段:视频、音频和元数据。在第 2 阶段,视频文件进一步分为两个任务:视频编码和缩略图。音频文件需要音频编码作为第 2 阶段任务的一部分。
As shown in Figure 14-15, the original video is split into 2 stages: Stage 1: video, audio, and metadata. The video file is further split into two tasks in stage 2: video encoding and thumbnail. The audio file requires audio encoding as part of the stage 2 tasks.
资源管理器负责管理资源分配的效率。它包含3个队列和一个任务调度程序,如图14-17所示。
The resource manager is responsible for managing the efficiency of resource allocation. It contains 3 queues and a task scheduler as shown in Figure 14-17.
•任务队列:它是一个优先级队列,包含要执行的任务。
•Task queue: It is a priority queue that contains tasks to be executed.
•工作线程队列:它是包含工作线程利用率信息的优先级队列。
•Worker queue: It is a priority queue that contains worker utilization info.
•运行队列:它包含有关当前正在运行的任务和运行任务的工作人员的信息。
•Running queue: It contains info about the currently running tasks and workers running the tasks.
•任务调度程序:它选择最佳的任务/工作人员,并指示所选的任务工作人员执行作业。
•Task scheduler: It picks the optimal task/worker, and instructs the chosen task worker to execute the job.
资源管理器的工作原理如下:
The resource manager works as follows:
•任务调度程序从任务队列中获取最高优先级的任务。
•The task scheduler gets the highest priority task from the task queue.
•任务调度程序从工作队列中获取最佳任务工作线程来运行任务。
•The task scheduler gets the optimal task worker to run the task from the worker queue.
•任务调度程序指示选定的任务工作人员运行任务。
•The task scheduler instructs the chosen task worker to run the task.
•任务调度程序绑定任务/工作者信息并将其放入运行队列中。
•The task scheduler binds the task/worker info and puts it in the running queue.
•作业完成后,任务调度程序将从运行队列中删除该作业。
•The task scheduler removes the job from the running queue once the job is done.
任务工作人员运行 DAG 中定义的任务。不同的任务工作者可以运行不同的任务,如图14-19所示。
Task workers run the tasks which are defined in the DAG. Different task workers may run different tasks as shown in Figure 14-19.
这里使用了多个存储系统。存储系统的选择取决于数据类型、数据大小、访问频率、数据寿命等因素。例如,元数据被工作人员频繁访问,并且数据大小通常较小。因此,在内存中缓存元数据是一个好主意。对于视频或音频数据,我们将它们放入 blob 存储中。一旦相应的视频处理完成,临时存储中的数据就会被释放。
Multiple storage systems are used here. The choice of storage system depends on factors like data type, data size, access frequency, data life span, etc. For instance, metadata is frequently accessed by workers, and the data size is usually small. Thus, caching metadata in memory is a good idea. For video or audio data, we put them in blob storage. Data in temporary storage is freed up once the corresponding video processing is complete.
编码视频是编码管道的最终输出。以下是输出示例:funny_720p.mp 4 。
Encoded video is the final output of the encoding pipeline. Here is an example of the output: funny_720p.mp4.
至此,您应该对视频上传流程、视频推流流程以及视频转码有了一定的了解。接下来,我们将对系统进行优化,包括速度、安全性和节省成本。
At this point, you ought to have good understanding about the video uploading flow, video streaming flow and video transcoding. Next, we will refine the system with optimizations, including speed, safety, and cost-saving.
将视频作为一个整体上传是低效的。我们可以通过 GOP 对齐将视频分割成更小的块,如图 14-22 所示。
Uploading a video as a whole unit is inefficient. We can split a video into smaller chunks by GOP alignment as shown in Figure 14-22.
这允许在上次上传失败时快速恢复上传。客户端可以对视频文件进行GOP分割,以提高上传速度,如图14-23所示。
This allows fast resumable uploads when the previous upload failed. The job of splitting a video file by GOP can be implemented by the client to improve the upload speed as shown in Figure 14-23.
另一种提高上传速度的方法是在全球设立多个上传中心(图14-24)。美国用户可以将视频上传到北美上传中心,中国用户可以将视频上传到亚洲上传中心。为了实现这一目标,我们使用 CDN 作为上传中心。
Another way to improve the upload speed is by setting up multiple upload centers across the globe (Figure 14-24). People in the United States can upload videos to the North America upload center, and people in China can upload videos to the Asian upload center. To achieve this, we use CDN as upload centers.
实现低延迟需要付出很大的努力。另一个优化是构建松耦合系统并实现高并行性。
Achieving low latency requires serious efforts. Another optimization is to build a loosely coupled system and enable high parallelism.
我们的设计需要一些修改才能实现高并行性。让我们放大一下视频从原始存储传输到 CDN 的流程。该流程如图14-25所示,表明输出取决于上一步的输入。这种依赖性使得并行性变得困难。
Our design needs some modifications to achieve high parallelism. Let us zoom in to the flow of how a video is transferred from original storage to the CDN. The flow is shown in Figure 14-25, revealing that the output depends on the input of the previous step. This dependency makes parallelism difficult.
为了使系统更加松散耦合,我们引入了消息队列,如图14-26所示。让我们用一个例子来解释一下消息队列如何让系统变得更加松散耦合。
To make the system more loosely coupled, we introduced message queues as shown in Figure 14-26. Let us use an example to explain how message queues make the system more loosely coupled.
•在引入消息队列之前,编码模块必须等待下载模块的输出。
•Before the message queue is introduced, the encoding module must wait for the output of the download module.
•引入消息队列后,编码模块不再需要等待下载模块的输出。如果消息队列中有事件,编码模块可以并行执行这些作业。
•After the message queue is introduced, the encoding module does not need to wait for the output of the download module anymore. If there are events in the message queue, the encoding module can execute those jobs in parallel.
安全是任何产品最重要的方面之一。为了确保只有授权用户才能将视频上传到正确的位置,我们引入了预签名 URL,如图 14-27 所示。
Safety is one of the most important aspects of any product. To ensure only authorized users upload videos to the right location, we introduce pre-signed URLs as shown in Figure 14-27.
上传流程更新如下:
The upload flow is updated as follows:
1. 客户端向 API 服务器发出 HTTP 请求以获取预签名 URL,该 URL 授予对 URL 中标识的对象的访问权限。术语“预签名 URL”用于将文件上传到 Amazon S3。其他云服务提供商可能使用不同的名称。例如,Microsoft Azure blob 存储支持相同的功能,但将其称为“共享访问签名”[10]。
1. The client makes a HTTP request to API servers to fetch the pre-signed URL, which gives the access permission to the object identified in the URL. The term pre-signed URL is used by uploading files to Amazon S3. Other cloud service providers might use a different name. For instance, Microsoft Azure blob storage supports the same feature, but call it “Shared Access Signature” [10].
2. API 服务器使用预签名 URL 进行响应。
2. API servers respond with a pre-signed URL.
3. 客户端收到响应后,会使用预签名 URL 上传视频。
3. Once the client receives the response, it uploads the video using the pre-signed URL.
许多内容制作者不愿意在网上发布视频,因为他们担心自己的原创视频会被盗。为了保护受版权保护的视频,我们可以采用以下三种安全选项之一:
Many content makers are reluctant to post videos online because they fear their original videos will be stolen. To protect copyrighted videos, we can adopt one of the following three safety options:
•数字版权管理(DRM) 系统:三个主要的DRM 系统是Apple FairPlay、Google Widevine 和Microsoft PlayReady。
•Digital rights management (DRM) systems: Three major DRM systems are Apple FairPlay, Google Widevine, and Microsoft PlayReady.
• AES 加密:您可以对视频进行加密并配置授权策略。加密的视频将在播放时解密。这确保只有授权用户才能观看加密视频。
•AES encryption: You can encrypt a video and configure an authorization policy. The encrypted video will be decrypted upon playback. This ensures that only authorized users can watch an encrypted video.
•视觉水印:这是视频顶部的图像叠加层,其中包含视频的识别信息。它可以是您的公司徽标或公司名称。
•Visual watermarking: This is an image overlay on top of your video that contains identifying information for your video. It can be your company logo or company name.
CDN 是我们系统的重要组成部分。它确保在全球范围内快速传输视频。然而,从背面的计算来看,我们知道CDN是昂贵的,尤其是当数据量很大时。我们怎样才能降低成本呢?
CDN is a crucial component of our system. It ensures fast video delivery on a global scale. However, from the back of the envelope calculation, we know CDN is expensive, especially when the data size is large. How can we reduce the cost?
之前的研究表明 YouTube 视频流遵循长尾分布 [11] [12]。这意味着一些热门视频会被频繁访问,但许多其他视频的观看者很少或根本没有。基于这一观察,我们实施了一些优化:
Previous research shows that YouTube video streams follow long-tail distribution [11] [12]. It means a few popular videos are accessed frequently but many others have few or no viewers. Based on this observation, we implement a few optimizations:
1. 仅提供 CDN 中最受欢迎的视频以及我们的高容量存储视频服务器中的其他视频(图 14-28)。
1. Only serve the most popular videos from CDN and other videos from our high capacity storage video servers (Figure 14-28).
2.对于不太受欢迎的内容,我们可能不需要存储很多编码的视频版本。短视频可以按需编码。
2. For less popular content, we may not need to store many encoded video versions. Short videos can be encoded on-demand.
3. 某些视频仅在某些地区流行。无需将这些视频分发到其他地区。
3. Some videos are popular only in certain regions. There is no need to distribute these videos to other regions.
4. 构建您自己的 CDN(例如 Netflix)并与互联网服务提供商 (ISP) 合作。构建 CDN 是一项艰巨的工程;然而,这对于大型流媒体公司来说可能有意义。ISP 可以是 Comcast、AT&T、Verizon 或其他互联网提供商。ISP 遍布世界各地并且贴近用户。通过与 ISP 合作,您可以改善观看体验并降低带宽费用。
4. Build your own CDN like Netflix and partner with Internet Service Providers (ISPs). Building your CDN is a giant project; however, this could make sense for large streaming companies. An ISP can be Comcast, AT&T, Verizon, or other internet providers. ISPs are located all around the world and are close to users. By partnering with ISPs, you can improve the viewing experience and reduce the bandwidth charges.
所有这些优化都是基于内容流行度、用户访问模式、视频大小等。在进行任何优化之前分析历史观看模式非常重要。以下是一些关于该主题的有趣文章:[12][13]。
All those optimizations are based on content popularity, user access pattern, video size, etc. It is important to analyze historical viewing patterns before doing any optimization. Here are some of the interesting articles on this topic: [12] [13].
对于大型系统来说,系统错误是不可避免的。为了构建一个高度容错的系统,我们必须优雅地处理错误并快速恢复。存在两种类型的错误:
For a large-scale system, system errors are unavoidable. To build a highly fault-tolerant system, we must handle errors gracefully and recover from them fast. Two types of errors exist:
•可恢复的错误。对于可恢复的错误,例如视频片段转码失败,一般的想法是重试操作几次。如果任务继续失败并且系统认为它不可恢复,则会向客户端返回正确的错误代码。
•Recoverable error. For recoverable errors such as video segment fails to transcode, the general idea is to retry the operation a few times. If the task continues to fail and the system believes it is not recoverable, it returns a proper error code to the client.
•不可恢复的错误。对于不可恢复的错误(例如视频格式错误),系统会停止与视频相关的正在运行的任务,并向客户端返回正确的错误代码。
•Non-recoverable error. For non-recoverable errors such as malformed video format, the system stops the running tasks associated with the video and returns the proper error code to the client.
以下手册涵盖了每个系统组件的典型错误:
Typical errors for each system component are covered by the following playbook:
•上传错误:重试几次。
•Upload error: retry a few times.
•分割视频错误:如果旧版本的客户端无法按GOP 对齐方式分割视频,则整个视频将传递到服务器。分割视频的工作是在服务器端完成的。
•Split video error: if older versions of clients cannot split videos by GOP alignment, the entire video is passed to the server. The job of splitting videos is done on the server-side.
•转码错误:重试。
•Transcoding error: retry.
•预处理器错误:重新生成DAG 图。
•Preprocessor error: regenerate DAG diagram.
• DAG 调度程序错误:重新调度任务。
•DAG scheduler error: reschedule a task.
•资源管理器队列关闭:使用副本。
•Resource manager queue down: use a replica.
•任务工作线程关闭:在新工作线程上重试该任务。
•Task worker down: retry the task on a new worker.
• API 服务器宕机:API 服务器是无状态的,因此请求将被定向到不同的API 服务器。
•API server down: API servers are stateless so requests will be directed to a different API server.
•元数据缓存服务器宕机:数据被复制多次。如果一个节点宕机了,你仍然可以访问其他节点来获取数据。我们可以启动一台新的缓存服务器来替换已失效的缓存服务器。
•Metadata cache server down: data is replicated multiple times. If one node goes down, you can still access other nodes to fetch data. We can bring up a new cache server to replace the dead one.
•元数据数据库服务器关闭:
•Metadata DB server down:
•主服务器已关闭。如果master宕机了,就提升其中一个slave作为新的master。
•Master is down. If the master is down, promote one of the slaves to act as the new master.
•从站已关闭。如果一个从属服务器出现故障,您可以使用另一个从属服务器进行读取,并启动另一台数据库服务器来替换失效的数据库服务器。
•Slave is down. If a slave goes down, you can use another slave for reads and bring up another database server to replace the dead one.
在本章中,我们介绍了 YouTube 等视频流服务的架构设计。如果面试结束时还有额外的时间,这里有几点补充:
In this chapter, we presented the architecture design for video streaming services like YouTube. If there is extra time at the end of the interview, here are a few additional points:
•扩展API 层:由于API 服务器是无状态的,因此很容易水平扩展API 层。
•Scale the API tier: Because API servers are stateless, it is easy to scale API tier horizontally.
•扩展数据库:您可以讨论数据库复制和分片。
•Scale the database: You can talk about database replication and sharding.
•直播:指实时录制和播放视频的过程。虽然我们的系统不是专门为直播设计的,但直播和非直播有一些相似之处:都需要上传、编码和直播。显着的差异是:
•Live streaming: It refers to the process of how a video is recorded and broadcasted in real time. Although our system is not designed specifically for live streaming, live streaming and non-live streaming have some similarities: both require uploading, encoding, and streaming. The notable differences are:
•实时流媒体具有较高的延迟要求,因此可能需要不同的流媒体协议。
•Live streaming has a higher latency requirement, so it might need a different streaming protocol.
•直播对并行性的要求较低,因为小块数据已经被实时处理。
•Live streaming has a lower requirement for parallelism because small chunks of data are already processed in real-time.
•直播需要不同的错误处理集。任何花费太多时间的错误处理都是不可接受的。
•Live streaming requires different sets of error handling. Any error handling that takes too much time is not acceptable.
•视频下架:侵犯版权、色情或其他非法行为的视频应被删除。有些可以在上传过程中被系统发现,而另一些则可以通过用户标记发现。
•Video takedowns: Videos that violate copyrights, pornography, or other illegal acts shall be removed. Some can be discovered by the system during the upload process, while others might be discovered through user flagging.
恭喜您已经走到这一步了!现在拍拍自己的背吧。好工作!
Congratulations on getting this far! Now give yourself a pat on the back. Good job!
参考资料
Reference materials
[1] YouTube 数据: https: //www.omnicoreagency.com/youtube-statistics/
[1] YouTube by the numbers: https://www.omnicoreagency.com/youtube-statistics/
[2] 2019 年 YouTube 人口统计数据:
[2] 2019 YouTube Demographics:
https://blog.hubspot.com/marketing/youtube-demgraphics
https://blog.hubspot.com/marketing/youtube-demographics
[3] Cloudfront 定价:https://aws.amazon.com/cloudfront/pricing/
[3] Cloudfront Pricing: https://aws.amazon.com/cloudfront/pricing/
[4] AWS 上的 Netflix:https ://aws.amazon.com/solutions/case-studies/netflix/
[4] Netflix on AWS: https://aws.amazon.com/solutions/case-studies/netflix/
[5] Akamai主页: https: //www.akamai.com/
[5] Akamai homepage: https://www.akamai.com/
[6] 二进制大对象: https ://en.wikipedia.org/wiki/Binary_large_object
[6] Binary large object: https://en.wikipedia.org/wiki/Binary_large_object
[7] 以下是您需要了解的有关流协议的信息:
[7] Here’s What You Need to Know About Streaming Protocols:
https://www.dacast.com/blog/streaming-protocols/
https://www.dacast.com/blog/streaming-protocols/
[8] SVE:Facebook 规模的分布式视频处理:https://www.cs.princeton.edu/~wlloyd/papers/sve-sosp17.pdf
[8] SVE: Distributed Video Processing at Facebook Scale: https://www.cs.princeton.edu/~wlloyd/papers/sve-sosp17.pdf
[9]微博视频处理架构(中文):https://www.upyun.com/opentalk/399.html
[9] Weibo video processing architecture (in Chinese): https://www.upyun.com/opentalk/399.html
[10] 使用共享访问签名委托访问:
[10] Delegate access with a shared access signature:
[11] 早期 YouTube 员工的 YouTube 可扩展性演讲:https://www.youtube.com/watch?v =w5WVu624fY8
[11] YouTube scalability talk by early YouTube employee: https://www.youtube.com/watch?v=w5WVu624fY8
[12]理解互联网短视频分享的特征:基于youtube的测量研究。https://arxiv.org/pdf/0707.3670.pdf
[12] Understanding the characteristics of internet short video sharing: A youtube-based measurement study. https://arxiv.org/pdf/0707.3670.pdf
[13] Open Connect 的内容流行度:
[13] Content Popularity for Open Connect:
https://netflixtechblog.com/content-popularity-for-open-connect-b86d56f613b
https://netflixtechblog.com/content-popularity-for-open-connect-b86d56f613b
近年来,Google Drive、Dropbox、Microsoft OneDrive、Apple iCloud 等云存储服务变得非常流行。在本章中,您需要设计 Google Drive。
In recent years, cloud storage services such as Google Drive, Dropbox, Microsoft OneDrive, and Apple iCloud have become very popular. In this chapter, you are asked to design Google Drive.
在开始设计之前,让我们花点时间来了解一下 Google Drive。Google Drive 是一种文件存储和同步服务,可帮助您在云中存储文档、照片、视频和其他文件。您可以从任何计算机、智能手机和平板电脑访问您的文件。您可以轻松地与朋友、家人和同事共享这些文件 [1]。图 15-1 和 15-2 分别显示了 Google Drive 在浏览器和移动应用程序上的外观。
Let us take a moment to understand Google Drive before jumping into the design. Google Drive is a file storage and synchronization service that helps you store documents, photos, videos, and other files in the cloud. You can access your files from any computer, smartphone, and tablet. You can easily share those files with friends, family, and coworkers [1]. Figure 15-1 and 15-2 show what Google drive looks like on a browser and mobile application, respectively.
设计 Google 驱动器是一个大项目,因此提出问题以缩小范围非常重要。
Designing a Google drive is a big project, so it is important to ask questions to narrow down the scope.
候选人:最重要的特征是什么?
Candidate: What are the most important features?
面试官:上传下载文件、文件同步、通知。
Interviewer: Upload and download files, file sync, and notifications.
候选人:这是移动应用程序、网络应用程序,还是两者兼而有之?
Candidate: Is this a mobile app, a web app, or both?
采访者:两者都有。
Interviewer: Both.
考生:支持哪些文件格式?
Candidate: What are the supported file formats?
面试官:任何文件类型。
Interviewer: Any file type.
考生:文件需要加密吗?
Candidate: Do files need to be encrypted?
采访:是的,存储中的文件必须加密。
Interview: Yes, files in the storage must be encrypted.
考生:文件大小有限制吗?
Candidate: Is there a file size limit?
面试:是的,文件必须为 10 GB 或更小。
Interview: Yes, files must be 10 GB or smaller.
应聘者:该产品有多少用户?
Candidate: How many users does the product have?
面试官:10M DAU。
Interviewer: 10M DAU.
在本章中,我们重点关注以下功能:
In this chapter, we focus on the following features:
•添加文件。添加文件的最简单方法是将文件拖放到 Google 云端硬盘中。
•Add files. The easiest way to add a file is to drag and drop a file into Google drive.
•下载文件。
•Download files.
•跨多个设备同步文件。将文件添加到一台设备后,它会自动同步到其他设备。
•Sync files across multiple devices. When a file is added to one device, it is automatically synced to other devices.
•查看文件修订版本。
• See file revisions.
•与您的朋友、家人和同事共享文件
•Share files with your friends, family, and coworkers
•当文件被编辑、删除或与您共享时发送通知。
•Send a notification when a file is edited, deleted, or shared with you.
本章未讨论的功能包括:
Features not discussed in this chapter include:
• Google 文档编辑和协作。Google doc 允许多人同时编辑同一个文档。这超出了我们的设计范围。
•Google doc editing and collaboration. Google doc allows multiple people to edit the same document simultaneously. This is out of our design scope.
除了澄清需求之外,了解非功能性需求也很重要:
Other than clarifying requirements, it is important to understand non-functional requirements:
•可靠性。可靠性对于存储系统来说极其重要。数据丢失是不可接受的。
•Reliability. Reliability is extremely important for a storage system. Data loss is unacceptable.
•同步速度快。如果文件同步花费太多时间,用户就会变得不耐烦并放弃该产品。
•Fast sync speed. If file sync takes too much time, users will become impatient and abandon the product.
•带宽使用情况。如果产品占用大量不必要的网络带宽,用户将会不高兴,尤其是当他们使用移动数据计划时。
•Bandwidth usage. If a product takes a lot of unnecessary network bandwidth, users will be unhappy, especially when they are on a mobile data plan.
•可扩展性。系统应该能够处理大量流量。
•Scalability. The system should be able to handle high volumes of traffic.
•高可用性。当某些服务器离线、速度变慢或出现意外的网络错误时,用户应该仍然能够使用系统。
•High availability. Users should still be able to use the system when some servers are offline, slowed down, or have unexpected network errors.
•假设该应用程序有5000 万注册用户和1000 万DAU。
•Assume the application has 50 million signed up users and 10 million DAU.
•用户获得10 GB 可用空间。
•Users get 10 GB free space.
•假设用户每天上传2 个文件。平均文件大小为 500 KB。
•Assume users upload 2 files per day. The average file size is 500 KB.
• 1:1 读写比。
•1:1 read to write ratio.
•分配的总空间:5000 万 * 10 GB = 500 PB
•Total space allocated: 50 million * 10 GB = 500 Petabyte
•上传API的QPS:1000万*2次上传/24小时/3600秒=~240
•QPS for upload API: 10 million * 2 uploads / 24 hours / 3600 seconds = ~ 240
•峰值 QPS = QPS * 2 = 480
•Peak QPS = QPS * 2 = 480
我们将使用稍微不同的方法,而不是从一开始就显示高级设计图。我们将从简单的事情开始:在单个服务器中构建所有内容。然后,逐步扩大规模以支持数百万用户。通过做这个练习,它将刷新你对书中涵盖的一些重要主题的记忆。
Instead of showing the high-level design diagram from the beginning, we will use a slightly different approach. We will start with something simple: build everything in a single server. Then, gradually scale it up to support millions of users. By doing this exercise, it will refresh your memory about some important topics covered in the book.
让我们从单个服务器设置开始,如下所示:
Let us start with a single server setup as listed below:
•用于上传和下载文件的网络服务器。
•A web server to upload and download files.
•用于跟踪用户数据、登录信息、文件信息等元数据的数据库。
•A database to keep track of metadata like user data, login info, files info, etc.
•存储文件的存储系统。我们分配1TB的存储空间来存储文件。
•A storage system to store files. We allocate 1TB of storage space to store files.
我们花了几个小时设置一个 Apache Web 服务器、一个 MySql 数据库和一个名为drive/的目录作为存储上传文件的根目录。在drive/目录下,有一个目录列表,称为命名空间。每个命名空间包含该用户的所有上传文件。服务器上的文件名与原始文件名保持一致。每个文件或文件夹可以通过连接命名空间和相对路径来唯一标识。
We spend a few hours setting up an Apache web server, a MySql database, and a directory called drive/ as the root directory to store uploaded files. Under drive/ directory, there is a list of directories, known as namespaces. Each namespace contains all the uploaded files for that user. The filename on the server is kept the same as the original file name. Each file or folder can be uniquely identified by joining the namespace and the relative path.
图 15-3 显示了左侧/drive目录及其右侧展开视图的示例。
Figure 15-3 shows an example of how the /drive directory looks like on the left side and its expanded view on the right side.
API 是什么样的?我们主要需要 3 个 API:上传文件、下载文件和获取文件修订版本。
What do the APIs look like? We primary need 3 APIs: upload a file, download a file, and get file revisions.
1. 将文件上传到 Google 云端硬盘
1. Upload a file to Google Drive
支持两种类型的上传:
Two types of uploads are supported:
•简单的上传。文件较小时使用此上传类型。
•Simple upload. Use this upload type when the file size is small.
•可断点上传。当文件较大且网络中断的可能性较高时,请使用此上传类型。
•Resumable upload. Use this upload type when the file size is large and there is high chance of network interruption.
以下是断点续传API的示例:
Here is an example of resumable upload API:
https://api.example.com/files/upload?uploadType=resumable
https://api.example.com/files/upload?uploadType=resumable
参数:
Params:
• uploadType=可恢复
•uploadType=resumable
• data:要上传的本地文件。
•data: Local file to be uploaded.
可断点续传通过以下3个步骤实现[2]:
A resumable upload is achieved by the following 3 steps [2]:
•发送初始请求以检索可恢复URL。
•Send the initial request to retrieve the resumable URL.
•上传数据并监控上传状态。
•Upload the data and monitor upload state.
•如果上传受到干扰,请恢复上传。
•If upload is disturbed, resume the upload.
2. 从 Google 云端硬盘下载文件
2. Download a file from Google Drive
API 示例:https://api.example.com/files/download
Example API: https://api.example.com/files/download
参数:
Params:
•路径:下载文件路径。
•path: download file path.
参数示例:
Example params:
{
{
“路径”:“/recipes/soup/best_soup.txt”
"path": "/recipes/soup/best_soup.txt"
}
}
3. 获取文件修订版本
3. Get file revisions
示例 API:https://api.example.com/files/list_revisions
Example API: https://api.example.com/files/list_revisions
参数:
Params:
•路径:您想要获取修订历史记录的文件的路径。
•path: The path to the file you want to get the revision history.
• limit:返回的最大修订数量。
•limit: The maximum number of revisions to return.
参数示例:
Example params:
{
{
“路径”:“/recipes/soup/best_soup.txt”,
"path": "/recipes/soup/best_soup.txt",
“限制”:20
"limit": 20
}
}
所有 API 都需要用户身份验证并使用 HTTPS。安全套接字层 (SSL) 保护客户端和后端服务器之间的数据传输。
All the APIs require user authentication and use HTTPS. Secure Sockets Layer (SSL) protects data transfer between the client and backend servers.
随着上传的文件越来越多,最终您会收到空间已满的警报,如图 15-4 所示。
As more files are uploaded, eventually you get the space full alert as shown in Figure 15-4.
仅剩 10 MB 存储空间!这是紧急情况,因为用户无法再上传文件。我想到的第一个解决方案是对数据进行分片,因此将其存储在多个存储服务器上。图15-5展示了基于user_id的分片示例。
Only 10 MB of storage space is left! This is an emergency as users cannot upload files anymore. The first solution comes to mind is to shard the data, so it is stored on multiple storage servers. Figure 15-5 shows an example of sharding based on user_id.
您通宵达旦地设置数据库分片并密切监控它。一切再次顺利进行。您已经扑灭了火灾,但您仍然担心存储服务器中断时可能会导致数据丢失。您四处打听,您的后端专家朋友 Frank 告诉您,Netflix 和 Airbnb 等许多领先公司都使用 Amazon S3 进行存储。“Amazon Simple Storage Service (Amazon S3) 是一种对象存储服务,提供业界领先的可扩展性、数据可用性、安全性和性能”[3]。您决定做一些研究,看看它是否合适。
You pull an all-nighter to set up database sharding and monitor it closely. Everything works smoothly again. You have stopped the fire, but you are still worried about potential data losses in case of storage server outage. You ask around and your backend guru friend Frank told you that many leading companies like Netflix and Airbnb use Amazon S3 for storage. “Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance” [3]. You decide to do some research to see if it is a good fit.
经过大量阅读后,您对 S3 存储系统有了很好的了解,并决定将文件存储在 S3 中。Amazon S3 支持同区域和跨区域复制。区域是 Amazon Web Services (AWS) 拥有数据中心的地理区域。如图15-6所示,数据可以同区域(左侧)复制,也可以跨区域(右侧)复制。冗余文件存储在多个区域,防止数据丢失并确保可用性。存储桶就像文件系统中的文件夹。
After a lot of reading, you gain a good understanding of the S3 storage system and decide to store files in S3. Amazon S3 supports same-region and cross-region replication. A region is a geographic area where Amazon web services (AWS) have data centers. As shown in Figure 15-6, data can be replicated on the same-region (left side) and cross-region (right side). Redundant files are stored in multiple regions to guard against data loss and ensure availability. A bucket is like a folder in file systems.
将文件放入S3后,您终于可以睡个好觉了,不用担心数据丢失。为了防止将来发生类似问题,您决定对可以改进的领域进行进一步研究。您可以找到以下几个区域:
After putting files in S3, you can finally have a good night's sleep without worrying about data losses. To stop similar problems from happening in the future, you decide to do further research on areas you can improve. Here are a few areas you find:
•负载均衡器:添加负载均衡器以分配网络流量。负载均衡器可确保均匀分配流量,如果 Web 服务器出现故障,它将重新分配流量。
•Load balancer: Add a load balancer to distribute network traffic. A load balancer ensures evenly distributed traffic, and if a web server goes down, it will redistribute the traffic.
• Web 服务器:添加负载平衡器后,可以根据流量负载轻松添加/删除更多Web 服务器。
•Web servers: After a load balancer is added, more web servers can be added/removed easily, depending on the traffic load.
•元数据数据库:将数据库移出服务器以避免单点故障。同时,设置数据复制和分片以满足可用性和可扩展性要求。
•Metadata database: Move the database out of the server to avoid single point of failure. In the meantime, set up data replication and sharding to meet the availability and scalability requirements.
•文件存储:Amazon S3 用于文件存储。为了确保可用性和持久性,文件被复制到两个不同的地理区域。
•File storage: Amazon S3 is used for file storage. To ensure availability and durability, files are replicated in two separate geographical regions.
应用上述改进后,您已成功将 Web 服务器、元数据数据库和文件存储与单个服务器解耦。更新后的设计如图 15-7 所示。
After applying the above improvements, you have successfully decoupled web servers, metadata database, and file storage from a single server. The updated design is shown in Figure 15-7.
对于像 Google Drive 这样的大型存储系统,同步冲突时常发生。当两个用户同时修改同一个文件或文件夹时,就会发生冲突。我们怎样才能解决冲突呢?这是我们的策略:第一个处理的版本获胜,稍后处理的版本会发生冲突。图 15-8 显示了同步冲突的示例。
For a large storage system like Google Drive, sync conflicts happen from time to time. When two users modify the same file or folder at the same time, a conflict happens. How can we resolve the conflict? Here is our strategy: the first version that gets processed wins, and the version that gets processed later receives a conflict. Figure 15-8 shows an example of a sync conflict.
在图15-8中,用户1和用户2尝试同时更新同一个文件,但用户1的文件首先由我们的系统处理。用户 1 的更新操作成功,但是用户 2 遇到同步冲突。我们如何解决用户 2 的冲突?我们的系统显示同一文件的两个副本:用户 2 的本地副本和服务器的最新版本(图 15-9)。用户 2 可以选择合并两个文件或用另一个版本覆盖一个版本。
In Figure 15-8, user 1 and user 2 tries to update the same file at the same time, but user 1’s file is processed by our system first. User 1’s update operation goes through, but, user 2 gets a sync conflict. How can we resolve the conflict for user 2? Our system presents both copies of the same file: user 2’s local copy and the latest version from the server (Figure 15-9). User 2 has the option to merge both files or override one version with the other.
当多个用户同时编辑同一个文档时,保持文档同步具有挑战性。有兴趣的读者可以参考参考资料[4][5]。
While multiple users are editing the same document at the same, it is challenging to keep the document synchronized. Interested readers should refer to the reference material [4] [5].
图 15-10 说明了建议的高级设计。让我们检查一下系统的每个组件。
Figure 15-10 illustrates the proposed high-level design. Let us examine each component of the system.
用户:用户通过浏览器或移动应用程序使用应用程序。
User: A user uses the application either through a browser or mobile app.
块服务器:块服务器将块上传到云存储。块存储,简称块级存储,是一种在云环境中存储数据文件的技术。一个文件可以分为几个块,每个块都有一个唯一的哈希值,存储在我们的元数据数据库中。每个块都被视为一个独立的对象并存储在我们的存储系统(S3)中。为了重建文件,块以特定的顺序连接。至于块大小,我们使用Dropbox作为参考:它将块的最大大小设置为4MB [6]。
Block servers: Block servers upload blocks to cloud storage. Block storage, referred to as block-level storage, is a technology to store data files on cloud-based environments. A file can be split into several blocks, each with a unique hash value, stored in our metadata database. Each block is treated as an independent object and stored in our storage system (S3). To reconstruct a file, blocks are joined in a particular order. As for the block size, we use Dropbox as a reference: it sets the maximal size of a block to 4MB [6].
云存储:文件被分割成更小的块并存储在云存储中。
Cloud storage: A file is split into smaller blocks and stored in cloud storage.
冷存储:冷存储是一种设计用于存储不活动数据的计算机系统,这意味着文件长时间不被访问。
Cold storage: Cold storage is a computer system designed for storing inactive data, meaning files are not accessed for a long time.
负载均衡器:负载均衡器在 API 服务器之间均匀分配请求。
Load balancer: A load balancer evenly distributes requests among API servers.
API 服务器:它们负责除上传流程之外的几乎所有事务。API服务器用于用户身份验证、管理用户配置文件、更新文件元数据等。
API servers: These are responsible for almost everything other than the uploading flow. API servers are used for user authentication, managing user profile, updating file metadata, etc.
元数据数据库:存储用户、文件、块、版本等元数据。请注意,文件存储在云端,元数据数据库仅包含元数据。
Metadata database: It stores metadata of users, files, blocks, versions, etc. Please note that files are stored in the cloud and the metadata database only contains metadata.
元数据缓存:缓存一些元数据以便快速检索。
Metadata cache: Some of the metadata are cached for fast retrieval.
通知服务:它是一个发布者/订阅者系统,允许在发生某些事件时将数据从通知服务传输到客户端。在我们的具体情况下,当文件在其他地方添加/编辑/删除时,通知服务会通知相关客户,以便他们可以提取最新的更改。
Notification service: It is a publisher/subscriber system that allows data to be transferred from notification service to clients as certain events happen. In our specific case, notification service notifies relevant clients when a file is added/edited/removed elsewhere so they can pull the latest changes.
离线备份队列:如果客户端离线并且无法提取最新的文件更改,离线备份队列会存储信息,以便当客户端在线时同步更改。
Offline backup queue: If a client is offline and cannot pull the latest file changes, the offline backup queue stores the info so changes will be synced when the client is online.
我们已经在高层讨论了 Google Drive 的设计。有些组件很复杂,值得仔细检查;我们将在深入探讨中详细讨论这些内容。
We have discussed the design of Google Drive at the high-level. Some of the components are complicated and worth careful examination; we will discuss these in detail in the deep dive.
在本节中,我们将仔细研究以下内容:块服务器、元数据数据库、上传流程、下载流程、通知服务、节省存储空间和故障处理。
In this section, we will take a close look at the following: block servers, metadata database, upload flow, download flow, notification service, save storage space and failure handling.
对于定期更新的大文件,每次更新时发送整个文件会消耗大量带宽。为了最大限度地减少传输的网络流量,提出了两种优化方案:
For large files that are updated regularly, sending the whole file on each update consumes a lot of bandwidth. Two optimizations are proposed to minimize the amount of network traffic being transmitted:
•增量同步。当文件被修改时,使用同步算法仅同步修改的块而不是整个文件[7] [8]。
•Delta sync. When a file is modified, only modified blocks are synced instead of the whole file using a sync algorithm [7] [8].
•压缩。对块应用压缩可以显着减小数据大小。因此,根据文件类型使用压缩算法来压缩块。例如,gzip和bzip2用于压缩文本文件。压缩图像和视频需要不同的压缩算法。
•Compression. Applying compression on blocks can significantly reduce the data size. Thus, blocks are compressed using compression algorithms depending on file types. For example, gzip and bzip2 are used to compress text files. Different compression algorithms are needed to compress images and videos.
在我们的系统中,块服务器承担上传文件的繁重工作。块服务器通过将文件拆分为块、压缩每个块并对其进行加密来处理从客户端传递的文件。不是将整个文件上传到存储系统,而是仅传输修改后的块。
In our system, block servers do the heavy lifting work for uploading files. Block servers process files passed from clients by splitting a file into blocks, compressing each block, and encrypting them. Instead of uploading the whole file to the storage system, only modified blocks are transferred.
图 15-11 显示了添加新文件时块服务器的工作方式。
Figure 15-11 shows how a block server works when a new file is added.
•文件被分割成更小的块。
•A file is split into smaller blocks.
•每个块都使用压缩算法进行压缩。
•Each block is compressed using compression algorithms.
•为了确保安全,每个块在发送到云存储之前都经过加密。
•To ensure security, each block is encrypted before it is sent to cloud storage.
•块被上传到云存储。
•Blocks are uploaded to the cloud storage.
图 15-12 说明了增量同步,这意味着只有修改过的块才会传输到云存储。突出显示的块“块2”和“块5”代表更改的块。使用增量同步,只有这两个块会上传到云存储。
Figure 15-12 illustrates delta sync, meaning only modified blocks are transferred to cloud storage. Highlighted blocks “block 2” and “block 5” represent changed blocks. Using delta sync, only those two blocks are uploaded to the cloud storage.
块服务器允许我们通过提供增量同步和压缩来节省网络流量。
Block servers allow us to save network traffic by providing delta sync and compression.
我们的系统默认要求强一致性。一个文件同时被不同的客户端以不同的方式显示是不可接受的。系统需要为元数据缓存和数据库层提供强一致性。
Our system requires strong consistency by default. It is unacceptable for a file to be shown differently by different clients at the same time. The system needs to provide strong consistency for metadata cache and database layers.
内存缓存默认采用最终一致性模型,这意味着不同的副本可能有不同的数据。为了实现强一致性,我们必须保证以下几点:
Memory caches adopt an eventual consistency model by default, which means different replicas might have different data. To achieve strong consistency, we must ensure the following:
•缓存副本和主服务器中的数据是一致的。
•Data in cache replicas and the master is consistent.
•使数据库写入时的缓存无效,以确保缓存和数据库保持相同的值。
•Invalidate caches on database write to ensure cache and database hold the same value.
在关系数据库中实现强一致性很容易,因为它保持了 ACID(原子性、一致性、隔离性、持久性)属性 [9]。但是,NoSQL 数据库默认不支持 ACID 属性。ACID 属性必须以编程方式合并到同步逻辑中。在我们的设计中,我们选择关系数据库,因为 ACID 是原生支持的。
Achieving strong consistency in a relational database is easy because it maintains the ACID (Atomicity, Consistency, Isolation, Durability) properties [9]. However, NoSQL databases do not support ACID properties by default. ACID properties must be programmatically incorporated in synchronization logic. In our design, we choose relational databases because the ACID is natively supported.
数据库架构设计如图15-13所示。请注意,这是一个高度简化的版本,因为它仅包含最重要的表格和有趣的字段。
Figure 15-13 shows the database schema design. Please note this is a highly simplified version as it only includes the most important tables and interesting fields.
User :用户表包含用户的基本信息,例如用户名、电子邮件、个人资料照片等。
User: The user table contains basic information about the user such as username, email, profile photo, etc.
设备:设备表存储设备信息。Push_id用于发送和接收移动推送通知。请注意,一个用户可以拥有多个设备。
Device: Device table stores device info. Push_id is used for sending and receiving mobile push notifications. Please note a user can have multiple devices.
命名空间:命名空间是用户的根目录。
Namespace: A namespace is the root directory of a user.
文件:文件表存储与最新文件相关的所有内容。
File: File table stores everything related to the latest file.
File_version :它存储文件的版本历史记录。现有行是只读的,以保持文件修订历史记录的完整性。
File_version: It stores version history of a file. Existing rows are read-only to keep the integrity of the file revision history.
块:它存储与文件块相关的所有内容。任何版本的文件都可以通过以正确的顺序连接所有块来重建。
Block: It stores everything related to a file block. A file of any version can be reconstructed by joining all the blocks in the correct order.
让我们讨论一下客户端上传文件时会发生什么。为了更好地理解流程,我们绘制如图15-14所示的序列图。
Let us discuss what happens when a client uploads a file. To better understand the flow, we draw the sequence diagram as shown in Figure 15-14.
在图15-14中,两个请求并行发送:添加文件元数据和将文件上传到云存储。这两个请求均源自客户端 1。
In Figure 15-14, two requests are sent in parallel: add file metadata and upload the file to cloud storage. Both requests originate from client 1.
•添加文件元数据。
•Add file metadata.
1. 客户端1发送请求添加新文件的元数据。
1. Client 1 sends a request to add the metadata of the new file.
2. 将新文件元数据存储在元数据数据库中,并将文件上传状态更改为“待处理”。
2. Store the new file metadata in metadata DB and change the file upload status to “pending.”
3. 通知通知服务正在添加新文件。
3. Notify the notification service that a new file is being added.
4、通知服务通知相关客户端(客户端2)有文件正在上传。
4. The notification service notifies relevant clients (client 2) that a file is being uploaded.
•将文件上传到云存储。
•Upload files to cloud storage.
2.1 客户端1将文件内容上传到区块服务器。
2.1 Client 1 uploads the content of the file to block servers.
2.2 块服务器将文件分块、压缩、加密,上传到云存储。
2.2 Block servers chunk the files into blocks, compress, encrypt the blocks, and upload them to cloud storage.
2.3 文件上传后,云存储触发上传完成回调。请求被发送到 API 服务器。
2.3 Once the file is uploaded, cloud storage triggers upload completion callback. The request is sent to API servers.
2.4 元数据数据库中文件状态更改为“已上传”。
2.4 File status changed to “uploaded” in Metadata DB.
2.5 通知通知服务文件状态更改为“已上传”。
2.5 Notify the notification service that a file status is changed to “uploaded.”
2.6 通知服务通知相关客户端(客户端2)文件已完全上传。
2.6 The notification service notifies relevant clients (client 2) that a file is fully uploaded.
编辑文件时,流程类似,不再赘述。
When a file is edited, the flow is similar, so we will not repeat it.
当在其他地方添加或编辑文件时,会触发下载流程。客户端如何知道文件是否被另一个客户端添加或编辑?客户可以通过两种方式了解:
Download flow is triggered when a file is added or edited elsewhere. How does a client know if a file is added or edited by another client? There are two ways a client can know:
•如果客户端A 在线,而文件被另一个客户端更改,则通知服务将通知客户端A 在某处进行了更改,因此需要拉取最新数据。
•If client A is online while a file is changed by another client, notification service will inform client A that changes are made somewhere so it needs to pull the latest data.
•如果客户端A 离线,而另一个客户端更改文件,则数据将保存到缓存中。当离线客户端再次上线时,它会拉取最新的更改。
•If client A is offline while a file is changed by another client, data will be saved to the cache. When the offline client is online again, it pulls the latest changes.
一旦客户端知道文件被更改,它首先通过 API 服务器请求元数据,然后下载块来构建文件。具体流程如图15-15所示。请注意,由于空间限制,图中仅显示了最重要的组件。
Once a client knows a file is changed, it first requests metadata via API servers, then downloads blocks to construct the file. Figure 15-15 shows the detailed flow. Note, only the most important components are shown in the diagram due to space constraint.
1. 通知服务通知客户端 2 文件在其他地方被更改。
1. Notification service informs client 2 that a file is changed somewhere else.
2. 一旦客户端 2 知道有新的更新可用,它就会发送请求以获取元数据。
2. Once client 2 knows that new updates are available, it sends a request to fetch metadata.
3. API 服务器调用元数据数据库来获取更改的元数据。
3. API servers call metadata DB to fetch metadata of the changes.
4. 元数据返回到API服务器。
4. Metadata is returned to the API servers.
5. 客户端2获取元数据。
5. Client 2 gets the metadata.
6. 客户端收到元数据后,会向区块服务器发送下载区块的请求。
6. Once the client receives the metadata, it sends requests to block servers to download blocks.
7. 区块服务器首先从云存储下载区块。
7. Block servers first download blocks from cloud storage.
8. 云存储将块返回给块服务器。
8. Cloud storage returns blocks to the block servers.
9. Client 2 downloads all the new blocks to reconstruct the file.
为了保持文件一致性,本地执行的任何文件变更都需要通知其他客户端以减少冲突。通知服务就是为了实现这一目的而构建的。在高层,通知服务允许在事件发生时将数据传输到客户端。以下是一些选项:
To maintain file consistency, any mutation of a file performed locally needs to be informed to other clients to reduce conflicts. Notification service is built to serve this purpose. At the high-level, notification service allows data to be transferred to clients as events happen. Here are a few options:
•长轮询。Dropbox 使用长轮询 [10]。
•Long polling. Dropbox uses long polling [10].
• WebSocket。WebSocket 提供客户端和服务器之间的持久连接。沟通是双向的。
•WebSocket. WebSocket provides a persistent connection between the client and the server. Communication is bi-directional.
尽管这两个选项都效果很好,但我们选择长轮询有以下两个原因:
Even though both options work well, we opt for long polling for the following two reasons:
•通知服务的通信不是双向的。服务器将有关文件更改的信息发送给客户端,但反之则不然。
•Communication for notification service is not bi-directional. The server sends information about file changes to the client, but not vice versa.
• WebSocket 适用于实时双向通信,例如聊天应用程序。对于 Google 云端硬盘,通知发送频率较低,没有数据突发。
•WebSocket is suited for real-time bi-directional communication such as a chat app. For Google Drive, notifications are sent infrequently with no burst of data.
通过长轮询,每个客户端都会与通知服务建立长轮询连接。如果检测到文件发生更改,客户端将关闭长轮询连接。关闭连接意味着客户端必须连接到元数据服务器才能下载最新的更改。收到响应或达到连接超时后,客户端立即发送新请求以保持连接打开。
With long polling, each client establishes a long poll connection to the notification service. If changes to a file are detected, the client will close the long poll connection. Closing the connection means a client must connect to the metadata server to download the latest changes. After a response is received or connection timeout is reached, a client immediately sends a new request to keep the connection open.
为了支持文件版本历史记录并确保可靠性,同一文件的多个版本存储在多个数据中心。频繁备份所有文件修订版本可能会很快填满存储空间。提出了三种降低存储成本的技术:
To support file version history and ensure reliability, multiple versions of the same file are stored across multiple data centers. Storage space can be filled up quickly with frequent backups of all file revisions. Three techniques are proposed to reduce storage costs:
•删除重复数据块。消除帐户级别的冗余块是节省空间的简单方法。如果两个块具有相同的哈希值,则它们是相同的。
•De-duplicate data blocks. Eliminating redundant blocks at the account level is an easy way to save space. Two blocks are identical if they have the same hash value.
•采用智能数据备份策略。可以应用两种优化策略:
•Adopt an intelligent data backup strategy. Two optimization strategies can be applied:
•设置限制:我们可以为要存储的版本数量设置限制。如果达到限制,最旧的版本将替换为新版本。
•Set a limit: We can set a limit for the number of versions to store. If the limit is reached, the oldest version will be replaced with the new version.
•仅保留有价值的版本:某些文件可能会被频繁编辑。例如,保存经过大量修改的文档的每个编辑版本可能意味着该文件在短时间内被保存超过 1000 次。为了避免不必要的副本,我们可以限制保存版本的数量。我们更加重视最新版本。实验有助于找出要保存的最佳版本数。
•Keep valuable versions only: Some files might be edited frequently. For example, saving every edited version for a heavily modified document could mean the file is saved over 1000 times within a short period. To avoid unnecessary copies, we could limit the number of saved versions. We give more weight to recent versions. Experimentation is helpful to figure out the optimal number of versions to save.
•将不常用的数据移至冷存储。冷数据是数月或数年不活跃的数据。像 Amazon S3 Glacier [11] 这样的冷存储比 S3 便宜得多。
•Moving infrequently used data to cold storage. Cold data is the data that has not been active for months or years. Cold storage like Amazon S3 glacier [11] is much cheaper than S3.
大规模系统中可能会发生故障,我们必须采用设计策略来解决这些故障。您的面试官可能有兴趣了解您如何处理以下系统故障:
Failures can occur in a large-scale system and we must adopt design strategies to address these failures. Your interviewer might be interested in hearing about how you handle the following system failures:
•负载平衡器故障:如果负载平衡器发生故障,辅助负载平衡器将变为活动状态并接收流量。负载均衡器通常使用心跳(负载均衡器之间发送的周期性信号)来相互监视。如果负载均衡器一段时间没有发送心跳,则被认为发生故障。
•Load balancer failure: If a load balancer fails, the secondary would become active and pick up the traffic. Load balancers usually monitor each other using a heartbeat, a periodic signal sent between load balancers. A load balancer is considered as failed if it has not sent a heartbeat for some time.
•块服务器故障:如果块服务器发生故障,其他服务器将拾取未完成或挂起的作业。
•Block server failure: If a block server fails, other servers pick up unfinished or pending jobs.
•云存储故障:S3 存储桶在不同区域进行多次复制。如果文件在一个区域不可用,则可以从不同区域获取它们。
•Cloud storage failure: S3 buckets are replicated multiple times in different regions. If files are not available in one region, they can be fetched from different regions.
• API 服务器故障:它是无状态服务。如果某个 API 服务器发生故障,流量将通过负载均衡器重定向到其他 API 服务器。
•API server failure: It is a stateless service. If an API server fails, the traffic is redirected to other API servers by a load balancer.
•元数据缓存故障:元数据缓存服务器被复制多次。如果一个节点宕机了,你仍然可以访问其他节点来获取数据。我们将启动一台新的缓存服务器来替换发生故障的服务器。
•Metadata cache failure: Metadata cache servers are replicated multiple times. If one node goes down, you can still access other nodes to fetch data. We will bring up a new cache server to replace the failed one.
•元数据数据库故障。
•Metadata DB failure.
•主节点宕机:如果主节点宕机,则将其中一个从节点提升为新的主节点,并启动一个新的从节点。
•Master down: If the master is down, promote one of the slaves to act as a new master and bring up a new slave node.
•从属服务器宕机:如果某个从属服务器宕机,您可以使用另一台从属服务器进行读取操作,并用另一台数据库服务器来替换发生故障的数据库服务器。
•Slave down: If a slave is down, you can use another slave for read operations and bring another database server to replace the failed one.
•通知服务失败:每个在线用户都与通知服务器保持长轮询连接。因此,每个通知服务器都与许多用户连接。根据 Dropbox 2012 年的演讲 [6],每台机器打开了超过 100 万个连接。如果服务器出现故障,所有长轮询连接都会丢失,因此客户端必须重新连接到不同的服务器。尽管一台服务器可以保持许多打开的连接,但它无法立即重新连接所有丢失的连接。与所有丢失的客户端重新连接是一个相对缓慢的过程。
•Notification service failure: Every online user keeps a long poll connection with the notification server. Thus, each notification server is connected with many users. According to the Dropbox talk in 2012 [6], over 1 million connections are open per machine. If a server goes down, all the long poll connections are lost so clients must reconnect to a different server. Even though one server can keep many open connections, it cannot reconnect all the lost connections at once. Reconnecting with all the lost clients is a relatively slow process.
•脱机备份队列失败:队列被复制多次。如果一个队列发生故障,该队列的使用者可能需要重新订阅备用队列。
•Offline backup queue failure: Queues are replicated multiple times. If one queue fails, consumers of the queue may need to re-subscribe to the backup queue.
在本章中,我们提出了支持Google Drive的系统设计。强一致性、低网络带宽和快速同步的结合使设计变得有趣。我们的设计包含两个流程:管理文件元数据和文件同步。通知服务是系统的另一个重要组成部分。它使用长轮询来让客户端及时了解文件更改。
In this chapter, we proposed a system design to support Google Drive. The combination of strong consistency, low network bandwidth and fast sync make the design interesting. Our design contains two flows: manage file metadata and file sync. Notification service is another important component of the system. It uses long polling to keep clients up to date with file changes.
与任何系统设计面试问题一样,没有完美的解决方案。每家公司都有其独特的限制,您必须设计一个系统来适应这些限制。了解设计和技术选择的权衡非常重要。如果还剩几分钟,你可以讨论不同的设计选择。
Like any system design interview questions, there is no perfect solution. Every company has its unique constraints and you must design a system to fit those constraints. Knowing the tradeoffs of your design and technology choices are important. If there are a few minutes left, you can talk about different design choices.
例如,我们可以从客户端直接将文件上传到云存储,而不需要经过块服务器。这种方法的优点是文件上传速度更快,因为文件只需传输到云存储一次。在我们的设计中,文件首先传输到块服务器,然后传输到云存储。然而,新方法有一些缺点:
For example, we can upload files directly to cloud storage from the client instead of going through block servers. The advantage of this approach is that it makes file upload faster because a file only needs to be transferred once to the cloud storage. In our design, a file is transferred to block servers first, and then to the cloud storage. However, the new approach has a few drawbacks:
•首先,必须在不同平台(iOS、Android、Web)上实现相同的分块、压缩和加密逻辑。它很容易出错并且需要大量的工程工作。在我们的设计中,所有这些逻辑都在一个集中的地方实现:块服务器。
•First, the same chunking, compression, and encryption logic must be implemented on different platforms (iOS, Android, Web). It is error-prone and requires a lot of engineering effort. In our design, all those logics are implemented in a centralized place: block servers.
•其次,由于客户端很容易被黑客攻击或操纵,因此在客户端实现加密逻辑并不理想。
•Second, as a client can easily be hacked or manipulated, implementing encrypting logic on the client side is not ideal.
该系统的另一个有趣的演变是将在线/离线逻辑移至单独的服务。让我们称之为在线服务。通过将状态服务移出通知服务器,在线/离线功能可以轻松地由其他服务集成。
Another interesting evolution of the system is moving online/offline logic to a separate service. Let us call it presence service. By moving presence service out of notification servers, online/offline functionality can easily be integrated by other services.
恭喜您已经走到这一步了!现在拍拍自己的背吧。好工作!
Congratulations on getting this far! Now give yourself a pat on the back. Good job!
参考资料
Reference materials
[1] 谷歌云端硬盘:https ://www.google.com/drive/
[1] Google Drive: https://www.google.com/drive/
[2]上传文件数据:https://developers.google.com/drive/api/v2/manage-uploads
[2] Upload file data: https://developers.google.com/drive/api/v2/manage-uploads
[3]亚马逊S3: https: //aws.amazon.com/s3
[3] Amazon S3: https://aws.amazon.com/s3
[4] 差分同步https://neil.fraser.name/writing/sync/
[4] Differential Synchronization https://neil.fraser.name/writing/sync/
[5] 差分同步 youtube 谈话https://www.youtube.com/watch?v=S2Hp_1jqpY8
[5] Differential Synchronization youtube talk https://www.youtube.com/watch?v=S2Hp_1jqpY8
[6] 我们如何扩展 Dropbox: https: //youtu.be/PE4gwstWhmc
[6] How We’ve Scaled Dropbox: https://youtu.be/PE4gwstWhmc
[7] Tridgell, A. 和 Mackerras, P. (1996)。rsync 算法。
[7] Tridgell, A., & Mackerras, P. (1996). The rsync algorithm.
[8] 图书馆同步。(nd)。2015 年 4 月 18 日检索自https://github.com/librsync/librsync
[8] Librsync. (n.d.). Retrieved April 18, 2015, from https://github.com/librsync/librsync
[9] 酸:https: //en.wikipedia.org/wiki/ACID
[9] ACID: https://en.wikipedia.org/wiki/ACID
[10] Dropbox 安全白皮书:https://www.dropbox.com/static/business/resources/Security_Whitepaper.pdf
[10] Dropbox security white paper: https://www.dropbox.com/static/business/resources/Security_Whitepaper.pdf
[11]亚马逊S3冰川:https ://aws.amazon.com/glacier/faqs/
[11] Amazon S3 Glacier: https://aws.amazon.com/glacier/faqs/
设计好的系统需要多年的知识积累。一种捷径是深入了解现实世界的系统架构。以下是一系列有用的阅读材料。我们强烈建议您同时关注共享原则和底层技术。研究每种技术并了解其解决的问题是增强知识库和完善设计过程的好方法。
Designing good systems requires years of accumulation of knowledge. One shortcut is to dive into real-world system architectures. Below is a collection of helpful reading materials. We highly recommend you pay attention to both the shared principles and the underlying technologies. Researching each technology and understanding what problems it solves is a great way to strengthen your knowledge base and refine the design process.
以下资料可以帮助您了解不同公司背后真实系统架构的总体设计思路。
The following materials can help you understand general design ideas of real system architectures behind different companies.
Facebook 时间线:通过非规范化的力量为您带来:https://goo.gl/FCNrbm
Facebook Timeline: Brought To You By The Power Of Denormalization: https://goo.gl/FCNrbm
Facebook 规模: https: //goo.gl/NGTdCs
Scale at Facebook: https://goo.gl/NGTdCs
构建时间表:扩大规模以保存您的生活故事:https://goo.gl/8p5wDV
Building Timeline: Scaling up to hold your life story: https://goo.gl/8p5wDV
Facebook 上的 Erlang(Facebook 聊天):https://goo.gl/zSLHrj
Erlang at Facebook (Facebook chat): https://goo.gl/zSLHrj
脸书聊天:https: //goo.gl/qzSiWC
Facebook Chat: https://goo.gl/qzSiWC
大海捞针:Facebook 的照片存储:https://goo.gl/edj4FL
Finding a needle in Haystack: Facebook’s photo storage: https://goo.gl/edj4FL
服务 Facebook Multifeed:通过重新设计提高效率和性能:https://goo.gl/adFVMQ
Serving Facebook Multifeed: Efficiency, performance gains through redesign: https://goo.gl/adFVMQ
在 Facebook 上扩展 Memcache: https: //goo.gl/rZiAhX
Scaling Memcache at Facebook: https://goo.gl/rZiAhX
TAO:Facebook 的社交图谱分布式数据存储:https://goo.gl/Tk1DyH
TAO: Facebook’s Distributed Data Store for the Social Graph: https://goo.gl/Tk1DyH
亚马逊架构: https: //goo.gl/k4feoW
Amazon Architecture: https://goo.gl/k4feoW
Dynamo:亚马逊的高可用键值存储: https: //goo.gl/C7zxDL
Dynamo: Amazon’s Highly Available Key-value Store: https://goo.gl/C7zxDL
整个 Netflix 堆栈的 360 度视图: https: //goo.gl/rYSDTz
A 360 Degree View Of The Entire Netflix Stack: https://goo.gl/rYSDTz
这都是 A/Bout 测试:Netflix 实验平台:https://goo.gl/agbA4K
It’s All A/Bout Testing: The Netflix Experimentation Platform: https://goo.gl/agbA4K
Netflix 推荐:超越 5 星(第 1 部分):https://goo.gl/A4FkYi
Netflix Recommendations: Beyond the 5 stars (Part 1): https://goo.gl/A4FkYi
Netflix 推荐:超越 5 星(第 2 部分):https://goo.gl/XNPMXm
Netflix Recommendations: Beyond the 5 stars (Part 2): https://goo.gl/XNPMXm
谷歌架构: https: //goo.gl/dvkDiY
Google Architecture: https://goo.gl/dvkDiY
谷歌文件系统(谷歌文档):https://goo.gl/xj5n9R
The Google File System (Google Docs): https://goo.gl/xj5n9R
差分同步(Google 文档):https ://goo.gl/9zqG7x
Differential Synchronization (Google Docs): https://goo.gl/9zqG7x
YouTube 架构: https: //goo.gl/mCPRUF
YouTube Architecture: https://goo.gl/mCPRUF
西雅图可扩展性会议:YouTube 可扩展性:https ://goo.gl/dH3zYq
Seattle Conference on Scalability: YouTube Scalability: https://goo.gl/dH3zYq
Bigtable:结构化数据的分布式存储系统:https ://goo.gl/6NaZca
Bigtable: A Distributed Storage System for Structured Data: https://goo.gl/6NaZca
Instagram 架构:1400 万用户、TB 级照片、数百个实例、数十种技术:https://goo.gl/s1VcW5
Instagram Architecture: 14 Million Users, Terabytes Of Photos, 100s Of Instances, Dozens Of Technologies: https://goo.gl/s1VcW5
Twitter 用于处理 1.5 亿活跃用户的架构:https://goo.gl/EwvfRd
The Architecture Twitter Uses To Deal With 150M Active Users: https://goo.gl/EwvfRd
扩展 Twitter:使 Twitter 速度提高 10000%: https: //goo.gl/nYGC1k
Scaling Twitter: Making Twitter 10000 Percent Faster: https://goo.gl/nYGC1k
宣布推出 Snowflake(Snowflake 是一项网络服务,用于通过一些简单的保证大规模生成唯一 ID 号):https://goo.gl/GzVWYm
Announcing Snowflake (Snowflake is a network service for generating unique ID numbers at high scale with some simple guarantees): https://goo.gl/GzVWYm
大规模时间表: https: //goo.gl/8KbqTy
Timelines at Scale: https://goo.gl/8KbqTy
Uber 如何扩展其实时市场平台: https: //goo.gl/kGZuVy
How Uber Scales Their Real-Time Market Platform: https://goo.gl/kGZuVy
缩放 Pinterest: https: //goo.gl/KtmjW3
Scaling Pinterest: https://goo.gl/KtmjW3
Pinterest 架构更新: https: //goo.gl/w6rRsf
Pinterest Architecture Update: https://goo.gl/w6rRsf
LinkedIn 规模化简史: https: //goo.gl/8A1Pi8
A Brief History of Scaling LinkedIn: https://goo.gl/8A1Pi8
Flickr 架构: https: //goo.gl/dWtgYa
Flickr Architecture: https://goo.gl/dWtgYa
我们如何扩展 Dropbox: https: //goo.gl/NjBDtC
How We've Scaled Dropbox: https://goo.gl/NjBDtC
Facebook 以 190 亿美元收购的 WhatsApp 架构:https ://bit.ly/2AHJnFn
The WhatsApp Architecture Facebook Bought For $19 Billion: https://bit.ly/2AHJnFn
如果您要面试一家公司,最好阅读他们的工程博客并熟悉那里采用和实施的技术和系统。此外,工程博客提供了有关某些领域的宝贵见解。定期阅读它们可以帮助我们成为更好的工程师。
If you are going to interview with a company, it is a great idea to read their engineering blogs and get familiar with technologies and systems adopted and implemented there. Besides, engineering blogs provide invaluable insights about certain fields. Reading them regularly could help us become better engineers.
以下是知名大公司和初创公司的工程博客列表。
Here is a list of engineering blogs of well-known large companies and startups.
爱彼迎:https://medium.com/airbnb-engineering
Airbnb: https://medium.com/airbnb-engineering
亚马逊: https: //developer.amazon.com/blogs
Amazon: https://developer.amazon.com/blogs
体式: https: //blog.asana.com/category/eng
Asana: https://blog.asana.com/category/eng
Atlassian: https: //developer.atlassian.com/blog
Atlassian: https://developer.atlassian.com/blog
比特流: http: //engineering.bittorrent.com
Bittorrent: http://engineering.bittorrent.com
Cloudera: https: //blog.cloudera.com
Cloudera: https://blog.cloudera.com
Docker: https: //blog.docker.com
Docker: https://blog.docker.com
Dropbox: https: //blogs.dropbox.com/tech
Dropbox: https://blogs.dropbox.com/tech
易趣: http: //www.ebaytechblog.com
eBay: http://www.ebaytechblog.com
脸书:https: //code.facebook.com/posts
Facebook: https://code.facebook.com/posts
GitHub: https: //githubengineering.com
GitHub: https://githubengineering.com
谷歌: https: //developers.googleblog.com
Google: https://developers.googleblog.com
团购: https: //engineering.groupon.com
Groupon: https://engineering.groupon.com
高可扩展性:http://highscalability.com
Highscalability: http://highscalability.com
Instacart: https: //tech.instacart.com
Instacart: https://tech.instacart.com
Instagram: https: //engineering.instagram.com
Instagram: https://engineering.instagram.com
领英:https: //engineering.linkedin.com/blog
Linkedin: https://engineering.linkedin.com/blog
混合面板:https://mixpanel.com/blog
Mixpanel: https://mixpanel.com/blog
Netflix: https: //medium.com/netflix-techblog
Netflix: https://medium.com/netflix-techblog
隔壁: https: //engblog.nextdoor.com
Nextdoor: https://engblog.nextdoor.com
贝宝: https: //www.paypal-engineering.com
PayPal: https://www.paypal-engineering.com
Pinterest: https: //engineering.pinterest.com
Pinterest: https://engineering.pinterest.com
Quora: https: //engineering.quora.com
Quora: https://engineering.quora.com
Reddit: https://redditblog.com
Salesforce: https: //developer.salesforce.com/blogs/engineering
Salesforce: https://developer.salesforce.com/blogs/engineering
Shopify: https: //engineering.shopify.com
Shopify: https://engineering.shopify.com
松弛: https: //slack.engineering
Slack: https://slack.engineering
Soundcloud:https://developers.soundcloud.com/blog
Soundcloud: https://developers.soundcloud.com/blog
Spotify: https: //labs.spotify.com
Spotify: https://labs.spotify.com
条纹: https: //stripe.com/blog/engineering
Stripe: https://stripe.com/blog/engineering
系统设计入门: https: //github.com/donnemartin/system-design-primer
System design primer: https://github.com/donnemartin/system-design-primer
推特:https: //blog.twitter.com/engineering/en_us.html
Twitter: https://blog.twitter.com/engineering/en_us.html
图钉: https: //www.thumbtack.com/engineering
Thumbtack: https://www.thumbtack.com/engineering
Uber: http://eng.uber.com
雅虎: https: //yahooeng.tumblr.com
Yahoo: https://yahooeng.tumblr.com
耶尔普: https: //engineeringblog.yelp.com
Yelp: https://engineeringblog.yelp.com
缩放: https: //medium.com/zoom-developer-blog
Zoom: https://medium.com/zoom-developer-blog
软件工程面试很有挑战性,但好消息是,正确的准备可以带来很大的不同。技术面试通常涵盖以下领域之一:编码、系统设计或面向对象设计。为了帮助您找到梦想的工作,我们整理了一份可能有帮助的书籍清单。
Software engineering interviews are challenging, but the good news is that the right preparation can make a big difference. A technical interview usually covers one of these areas: coding, system design or object-oriented design. To help you land a dream job, we put together a list of books that might be helpful.
了解分布式系统作者:Roberto Vitillo
Understanding Distributed Systems by Roberto Vitillo
本书讲授分布式系统的基础知识。作者出色地解释了网络堆栈、数据一致性模型、弹性、可扩展性和可靠性模式等等。链接: http: //bit.ly/dissystems
This book teaches the fundamentals of distributed systems. The author does an excellent job explaining the network stack, data consistency models, resilience, scalability and reliability patterns, and much more. Link: http://bit.ly/dissystems
技术简历由内而外作者:Gergely Orosz
The Tech Resume Inside-Out by Gergely Orosz
一份出色的简历是您在众多竞争候选人中脱颖而出的门票。本书的内容经过充分研究,旨在帮助您制作一份专业的简历。最好的部分:作者联系了数十位经验丰富的技术招聘人员和实践经验丰富的招聘经理,以确保这本书有用且事实正确。全面披露:我个人认识他。链接: https: //bit.ly/3lRLWXh
A strong resume is your ticket to standing out among many competing candidates. This book’s content is well researched and aims to help you craft a professional looking resume. Best part: the author reached out to dozens of experienced technical recruiters and hands-on hiring managers to make sure the book is useful and factually correct. Full disclosure: I know him personally. Link: https://bit.ly/3lRLWXh
设计数据密集型应用程序作者:Martin Kleppmann
Designing Data-Intensive Applications by Martin Kleppmann
这本经典的 616 页书被认为是有抱负的分布式系统工程师的必读之作。最好的部分:这本书技术性很强,包含很多关于可扩展性、一致性、可靠性、效率和可维护性的深入讨论。链接: https: //amzn.to/2K0PLfq
This classic 616-page book is considered to be a must-read for aspiring engineers who work on distributed systems. Best part: this book is very technical and contains a lot of in-depth discussion about scalability, consistency, reliability, efficiency and maintainability. Link: https://amzn.to/2K0PLfq
免责声明:我只推荐我读过的书。本节包含附属链接。如果您使用这些链接购买东西,我们可能会赚取少量佣金。谢谢。
Disclaimer: I only recommend books that I have read. This section contains affiliate links. If you use these links to buy something, we may earn a small commission. Thanks.
恭喜!您已阅读完本采访指南。您已经积累了设计系统的技能和知识。并不是每个人都有纪律去学习你所学到的东西。花点时间拍拍自己的背。你的努力将会得到回报。
Congratulations! You are at the end of this interview guide. You have accumulated skills and knowledge to design systems. Not everyone has the discipline to learn what you have learned. Take a moment and pat yourself on the back. Your hard work will be paid off.
找到梦想的工作是一个漫长的旅程,需要大量的时间和努力。熟能生巧。好运!
Landing a dream job is a long journey and requires lots of time and effort. Practice makes perfect. Best luck!
感谢您购买并阅读本书。没有像您这样的读者,我们的工作就不会存在。我们希望您喜欢这本书!
Thank you for buying and reading this book. Without readers like you, our work would not exist. We hope you have enjoyed the book!
如果您不介意,请在亚马逊上评论这本书:http://bit.ly/sysreview8这将帮助我吸引更多像您这样的优秀读者。
If you don’t mind, please review this book on Amazon: http://bit.ly/sysreview8 It would help me attract more wonderful readers like you.
加入电子邮件列表
Join the Email List
我们即将完成 10 多个真实系统设计面试问题。如果您想在新章节可用时收到通知,请订阅我们的电子邮件列表:http://bit.ly/systemmail
We are getting close to finishing more than 10 real-world system design interview questions. Please subscribe to our email list if you want to be notified when new chapters are available: http://bit.ly/systemmail
加入社区
Join the community
我创建了一个仅限会员的 Discord 群组。它专为以下主题的社区讨论而设计:
I created a member-only Discord group. It is designed for community discussions on the following topics:
•系统设计基础。
•System design fundamentals.
•展示设计图并获取反馈。
•Showcase design diagrams and get feedback.
•寻找模拟面试伙伴。
•Find mock interview buddies.
•与社区成员的一般聊天。
•General chat with community members.
请立即加入我们,通过单击下面的链接或扫描条形码向社区介绍自己。
Come join us and introduce yourself to the community today by clicking the link below or scan the barcode.
如果您对本书有意见或疑问,请随时给我们发送电子邮件:systemdesigninsider@gmail.com。此外,如果您发现任何错误,请告知我们,以便我们在下一版本中进行更正。谢谢你!
If you have comments or questions about this book, feel free to send us an email at systemdesigninsider@gmail.com. Besides, if you notice any errors, please let us know so we can make corrections in the next edition. Thank you!